arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4060
2605.10009 2026-05-12 cs.CV

Hystar: Hypernetwork-driven Style-adaptive Retrieval via Dynamic SVD Modulation

Yujia Cai, Boxuan Li, Chenghao Xu, Jiexi Yan

AI总结 本文提出了一种名为Hystar的轻量级框架,用于解决基于查询的图像检索(QBIR)中因查询风格多样而导致的分布偏移问题。该方法通过超网络动态生成注意力层的奇异值扰动,实现对每个查询风格的自适应调整,同时利用静态奇异值偏移保证跨风格的稳定性。此外,Hystar引入了基于最优传输的对比损失StyleNCE,以增强跨风格语义区分能力,实验表明该方法在多风格检索和跨风格分类任务中均优于现有方法,具有参数高效且风格稳定的优势。

Comments Accepted by ICLR2026

详情
英文摘要

Query-based image retrieval (QBIR) requires retrieving relevant images given diverse and often stylistically heterogeneous queries, such as sketches, artworks, or low-resolution previews. While large-scale vision--language representation models (VLRMs) like CLIP offer strong zero-shot retrieval performance, they struggle with distribution shifts caused by unseen query styles. In this paper, we propose the Hypernetwork-driven Style-adaptive Retrieval (Hystar), a lightweight framework that dynamically adapts model weights to each query's style. Hystar employs a hypernetwork to generate singular-value perturbations ($ΔS$) for attention layers, enabling flexible per-input adaptation, while static singular-value offsets on MLP layers ensure cross-style stability. To better handle semantic confusions across styles, we design StyleNCE as part of Hystar, an optimal-transport-weighted contrastive loss that emphasizes hard cross-style negatives. Extensive experiments on multi-style retrieval and cross-style classification benchmarks demonstrate that Hystar consistently outperforms strong baselines, achieving state-of-the-art performance while being parameter-efficient and stable across styles.

2605.10002 2026-05-12 cs.CV

Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models

Minh Khoi Nguyen, Dai Lam Le, Amir Reza Jafari, Tuan Dung Nguyen, Mai Hong Son, Mai Huy Thong, Quang Huy Nguyen, Thanh Trung Nguyen, Reza Farahbakhsh, Noel Crespi, Phi Le Nguyen

AI总结 该研究提出Med-StepBench,首个用于评估医学视觉语言模型在3D PET/CT影像中逐步推理能力的大型基准,旨在检测模型在生成临床合理但错误的诊断时的幻觉问题。该框架将临床推理分解为四个诊断阶段,并通过超过12,000张影像和100万对图像-陈述对,揭示了现有模型在多步骤推理中的系统性缺陷。研究还表明,当前模型对看似合理但具有误导性的中间解释高度敏感,进一步放大了幻觉风险,为构建更安全可靠的医学视觉语言模型提供了重要依据。

Comments Accepted at IJCAI-ECAI 2026

详情
英文摘要

Large vision-language models (VLMs) demonstrate strong performance in medical image understanding, but frequently generate clinically plausible yet incorrect statements, raising significant safety concerns. Existing medical hallucination benchmarks primarily focus on 2D imaging with one-shot diagnostic questions, offering limited insight into whether predictions are grounded in correct localization and abnormality identification, allowing critical reasoning errors to remain hidden behind seemingly correct diagnoses. We introduce Med-StepBench, the first large-scale benchmark for step-wise hallucination detection in 3D oncological PET/CT, comprising over 12,000 images and more than 1,000,000 image-statement pairs across volumetric and multi-view 2D data, which decomposes clinical reasoning into four expert-designed diagnostic stages. Using clinician-verified annotations, we perform the first step-level evaluation of general-purpose and medical VLMs, revealing systematic failure modes obscured by aggregate accuracy metrics. Furthermore, we show that current VLMs are highly susceptible to adversarial yet clinically plausible intermediate explanations, which significantly amplify hallucinations despite contradictory visual evidence. Together, our findings highlight fundamental limitations in grounding multi-step clinical reasoning and establish Med-StepBench as a rigorous benchmark for developing safer and more reliable medical VLMs.

2605.10001 2026-05-12 cs.LG

Anchor-guided Hypergraph Condensation with Dual-level Discrimination

Fan Li, Xiaoyang Wang, Chen Chen, Wenjie Zhang

AI总结 随着超图规模的增大,超图神经网络的训练面临显著的计算挑战。为解决这一问题,本文提出了一种名为AHGCDD的超图压缩方法,通过引入锚点引导的超边合成策略和双层次判别目标,实现了结构与特征的联合优化,有效提升了压缩效率和下游任务性能。该方法在结构生成和特征压缩之间建立了更紧密的联系,避免了传统方法中结构与特征不一致的问题。实验表明,AHGCDD在多个基准数据集上表现出优越的压缩效果和计算效率。

Comments This paper has been accepted by ICML 2026

详情
英文摘要

The increasing prevalence of large-scale hypergraphs poses significant computational challenges for hypergraph neural network (HNN) training. To address this, hypergraph condensation (HGC) distills large real hypergraphs into compact yet informative synthetic ones, beyond graph condensation (GC) methods limited to pairwise relations. However, existing HGC methods rely on decoupled training architectures, where structure generators are pre-trained on the original hypergraph but not jointly optimized with condensed features during refinement, resulting in misaligned structures that degrade downstream utility. Moreover, trajectory-based optimization incurs substantial computational overhead in refinement, limiting condensation efficiency. To tackle these issues, we propose \textbf{A}nchor-guided \textbf{H}yper\textbf{G}raph \textbf{C}ondensation with \textbf{D}ual-level \textbf{D}iscrimination (\textbf{AHGCDD}), which consists of three key components: (1) a node initialization module based on Heat Kernel PageRank (HKPR) to encode structural knowledge into feature semantics; (2) an anchor-guided hyperedge synthesis strategy for joint optimization of condensed features and structure; (3) a theoretically grounded dual-level discrimination objective for utility-preserving condensation without redundant HNN training. Extensive experiments demonstrate the superior effectiveness and efficiency of AHGCDD.

2605.09999 2026-05-12 cs.RO cs.PF cs.SY eess.SY

Muninn: Your Trajectory Diffusion Model But Faster

Gokul Puthumanaillam, Hao Jiang, Ruben Hernandez, Jose Fuentes, Paulo Padrao, Leonardo Bobadilla, Melkior Ornik

AI总结 该论文提出了一种名为Muninn的训练无关缓存方法,旨在加速基于扩散模型的轨迹规划器,使其适用于实时机器人应用。其核心思想是利用扩散模型内部轨迹表示的变化信号和去噪误差的解析系数,动态判断是否复用缓存的去噪结果,从而减少不必要的计算。实验表明,Muninn在多个轨迹扩散模型上实现了最高4.6倍的加速,同时保持任务性能和安全性,并在实际硬件部署中验证了其有效性。

Comments Accepted to Robotics: Science and Systems 2026

详情
英文摘要

Diffusion-based trajectory planners can synthesize rich, multimodal robot motions, but their iterative denoising makes online planning and control prohibitively slow. Existing accelerations either modify the sampler or compress the network--sacrificing plan quality or requiring retraining without accounting for downstream control risk. We address the problem of making diffusion-based trajectory planners fast enough for real-time robot use without retraining the model or sacrificing trajectory quality, and in a way that works across diverse state-space diffusion architectures. Our key insight is that diffusion trajectory planners expose two signals we can exploit: a cheap probe of how their internal trajectory representation changes across steps, and analytic coefficients that describe how denoiser errors affect the sampler's state update. By calibrating the first signal against the second on offline runs, we obtain a per-step score that upper-bounds how far the final trajectory can deviate when we reuse a cached denoiser output, and we treat this bound as an uncertainty budget that we can spend over the denoising process. Building on this insight, we present Muninn, a training-free caching wrapper that tracks this uncertainty budget during sampling and, at each diffusion step, chooses between reusing a cached denoiser output when the predicted deviation is small and recomputing the denoiser when it is not. Across standard benchmarks Muninn delivers up to 4.6x wall-clock speedups across several trajectory diffusion models by reducing denoiser evaluations, while preserving task performance and safety metrics. Muninn further certifies that cached rollouts remain within a specified distance of their full-compute counterparts, and we validate these gains in real-time closed-loop navigation and manipulation hardware deployments. Project page: https://github.com/gokulp01/Muninn.

2605.09998 2026-05-12 cs.LG cs.AI

Continual Harness: Online Adaptation for Self-Improving Foundation Agents

Seth Karten, Joel Zhang, Tersoo Upaa, Ruirong Feng, Wenzhe Li, Chengshuai Shi, Chi Jin, Kiran Vodrahalli

AI总结 本文研究了具身智能体在长期部分可观测决策任务中的在线自适应问题,提出了“Continual Harness”方法,使智能体无需人工干预即可通过自身策略迭代和长期记忆优化实现持续自我改进。该方法从最小环境接口出发,通过交替执行和优化自身提示、子代理、技能及记忆,实现了在《宝可梦》游戏中的高效策略学习,并显著降低了操作成本,接近甚至部分超越了手工设计的专家系统。研究还构建了一个模型自身参与的在线过程-奖励联合学习闭环,推动了游戏内里程碑的持续进展。

Comments 28 pages, 19 figures, 5 tables

详情
英文摘要

Coding harnesses such as Claude Code and OpenHands wrap foundation models with tools, memory, and planning, but no equivalent exists for embodied agents' long-horizon partial-observability decision-making. We first report our Gemini Plays Pokemon (GPP) experiments. With iterative human-in-the-loop harness refinement, GPP became the first AI system to complete Pokemon Blue, Yellow Legacy on hard mode, and Crystal without a lost battle. In the hardest stages, the agent itself began iterating on its strategy through long-context memory, surfacing emergent self-improvement signals alongside human-in-the-loop refinement. Continual Harness removes the human fully from this loop: a reset-free self-improving harness for embodied agents that formalizes and automates what we observed. Starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data. Prompt-optimization methods require episode resets; Continual Harness adapts online within a single run. On Pokemon Red and Emerald across frontier models, Continual Harness starting from scratch substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness, with capability-dependent gains, despite starting from the same raw interface with no curated knowledge, no hand-crafted tools, and no domain scaffolding. We then close the loop with the model itself: an online process-reward co-learning loop, in which an open-source agent's rollouts through the refining harness are relabeled by a frontier teacher and used to update the model, drives sustained in-game milestone progress on Pokemon Red without resetting the environment between training iterations.

2605.09996 2026-05-12 cs.CV

Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

Yeongtak Oh, Dongwook Lee, Sangkwon Park, Heeseung Kim, Sungroh Yoon

AI总结 本文提出Omni-Persona,首个全面的多模态个性化基准,用于系统评估和改进文本、图像和音频的联合个性化能力。该基准通过“人格模态图”形式化任务,涵盖四个任务组和18个细粒度任务,并引入校准准确率(Cal)指标,综合衡量正确对齐与适当回避的能力。实验揭示了开源模型在音频与视觉对齐上的差距、参数规模与召回率并非可靠诊断指标,以及监督微调与基于奖励的强化学习在个性化中的不同局限与挑战。

Comments Project Page: https://github.com/oyt9306/Omni-Persona

详情
英文摘要

While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor to account for absent-persona scenarios or systematic grounding studies. We introduce Omni-Persona, the first comprehensive benchmark for omnimodal personalization. We formalize the task as cross-modal routing over the \emph{Persona Modality Graph}, encompassing 4 task groups and 18 fine-grained tasks across ${\sim}750$ items. To rigorously diagnose grounding behavior, we propose \emph{Calibrated Accuracy ($\mathrm{Cal}$)}, which jointly rewards correct grounding and appropriate abstention, incorporating absent-persona queries within a unified evaluation framework. On our dedicated experiments, three diagnostic findings emerge: (i) open-source models show a consistent audio-vs-visual grounding gap that RLVR partially narrows via dense rule-based supervision; (ii) answerable recall and parameter scale are incomplete diagnostics, since strong recall can coexist with absent-persona hallucination and larger models do not always achieve higher $\mathrm{Cal}$, exposing calibration as a separate evaluation axis; and (iii) SFT is bounded by the difficulty of constructing annotated ground-truth supervision at scale, while RLVR generalizes more consistently through outcome-level verifiable feedback yet drifts toward conservative behavior and lower generation quality under our reward design. Omni-Persona thus serves as a diagnostic framework that surfaces the pitfalls of omnimodal personalization, guiding future post-training and reward design.

2605.09995 2026-05-12 cs.CL

Annotations Mitigate Post-Training Mode Collapse

Jacob Mitchell Springer, Madhu Advani, Lukas Aichberger, Arwen Bradley, Eran Malach, Omid Saremi, Sinead Williamson, Preetum Nakkiran, Etai Littwin, Aditi Raghunathan

AI总结 该研究探讨了监督微调(SFT)在提升模型指令遵循能力的同时,可能导致语义模式崩溃的问题,并发现随着模型规模增大,这一问题更加严重。为此,作者提出了一种基于注释引导的训练方法,通过在预训练阶段使用带有语义注释的文档,保留注释分布并在微调过程中保持其多样性,从而在微调后仍能保持丰富的语义表达。实验表明,该方法有效缓解了语义多样性下降的问题,且效果随着模型规模提升而进一步增强。

Comments 21 pages, 8 figures, 11 tables. Accepted at ICML 2026

详情
英文摘要

Post-training (via supervised fine-tuning) improves instruction-following, but often induces semantic mode collapse by biasing models toward low-entropy fine-tuning data at the expense of the high-entropy pretraining distribution. Crucially, we find this trade-off worsens with scale. To close this semantic diversity gap, we propose annotation-anchored training, a principled method that enables models to adopt the preference-following behaviors of post-training without sacrificing the inherent diversity of pretraining. Our approach is simple: we pretrain on documents paired with semantic annotations, inducing a rich annotation distribution that reflects the full breadth of pretraining data, and we preserve this distribution during post-training. This lets us sample diverse annotations at inference time and use them as anchors to guide generation, effectively transferring pretraining's semantic richness into post-trained models. We find that models trained with annotation-anchored training can attain $6 \times$ less diversity collapse than models trained with SFT, and improve with scale.

2605.09993 2026-05-12 cs.LG

Learning Graph Foundation Models on Riemannian Graph-of-Graphs

Haokun Liu, Zezhong Ding, Xike Xie

AI总结 本文提出了一种基于黎曼图-of-图(GoG)结构的图基础模型R-GFM,旨在解决现有图基础模型在处理不同尺度和结构复杂性任务时存在的泛化能力不足问题。R-GFM通过在不同跳数的子图上构建多尺度的GoG,并从黎曼流形中学习几何自适应表示,从而更灵活地捕捉图数据的结构特征。实验表明,R-GFM在多个数据集上取得了最先进的性能,部分任务的相对提升达到49%。

Comments This paper has been accepted by ICML 2026

详情
英文摘要

Graph foundation models (GFMs), pretrained on massive graph data, have transformed graph machine learning by supporting general-purpose reasoning across diverse graph tasks and domains. Existing GFMs pretrained with fixed-hop subgraph sampling impose a fixed receptive field, causing scale mismatch on diverse tasks, which often require heterogeneous and unknown structural contexts beyond a fixed sampling scale. We propose R-GFM, a Riemannian Graph-of-Graphs (GoG) based foundation model, that treats structural scale as a first-class citizen in modeling. R-GFM constructs a multi-scale GoG over-sampled subgraphs at different hop distances and learns geometry-adaptive representations from Riemannian manifolds. Theoretical analysis shows that R-GFM reduces structural domain generalization error compared to fixed-scale GFMs. Experiments on various datasets demonstrate that R-GFM achieves state-of-the-art performance, with up to a 49% relative improvement on downstream tasks. Our code is available at https://github.com/USTC-DataDarknessLab/R-GFM.

2605.09992 2026-05-12 cs.LG cs.AI

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Doğaç Eldenk, Payal Mohapatra, Yigitcan Comlek, Kaan Oktay, Hongyang Zhang, Stephen Xia

AI总结 本文研究了自回归推测解码模型在生成过程中注意力分布的变化现象,称为“注意力漂移”,即模型在生成连续token时,注意力逐渐从原始提示转移到自身生成的内容上。研究发现这一现象源于模型内部未归一化的残差路径,导致隐藏状态随生成深度不断增长。为此,作者提出了两种架构改进方法,包括对隐藏状态进行后归一化和逐状态RMS归一化,有效提升了模型在模板扰动、长上下文任务及多个基准测试中的生成长度和泛化能力。

详情
英文摘要

Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \textbf{attention drift}: as the drafter generates successive tokens within a speculation chain, attention progressively moves from the prompt onto its own recently-generated tokens. We observe this across both \emph{EAGLE3} drafters and \emph{MTP heads}, suggesting drift is a property of drafter designs. We trace this to the un-normalized residual path between chain steps: the drafter's hidden state magnitude grows monotonically with chain depth, which exhibits dynamics consistent with additional pre-norm transformer layers stacked on the target rather than as a standalone autoregressive predictor. In order to limit the growth, we propose two architectural changes: Post-norm on the drafter hidden states and per-hidden-state RMSNorm after capturing target hidden states. Our interventions improve acceptance length over the current leading model, pre-norm EAGLE3, by up to $2\times$ under template perturbation, $1.18\times$ on long-context tasks, and $1.10\times$ on seven standard benchmarks spanning multi-turn chat, math, and coding. Our changes also allow shorter train-time-test depths to generalize over longer drafting sequences.

2605.09991 2026-05-12 cs.AI cs.LG math.OC

Optimizer-Induced Mode Connectivity: From AdamW to Muon

Fangzhao Zhang, Sungyoon Kim, Erica Zhang, Yiqi Jiang, Mert Pilanci

AI总结 本文研究了优化器对模式连通性的影响,探讨了在给定优化器约束下解空间的连通性行为。通过分析两层ReLU网络,发现当网络宽度足够大时,由单一优化器(如AdamW、Muon等)生成的解构成一个连通集,这一结果超越了以往的研究。实验表明,不同优化器生成的解区域可能因正则化条件而相互分离或重叠,且在GPT-2预训练中,同一优化器路径保持模型谱特性,而跨优化器路径则表现出平滑过渡,揭示了优化器对解空间结构的重要影响。

详情
英文摘要

Mode connectivity has been widely studied, yet the role of the optimizer remains underexplored. We revisit it through optimizer-induced implicit regularization, asking how connectivity behaves when restricted to solutions constrained by a given optimizer. For two-layer ReLU networks, we show that solutions from a single optimizer -- AdamW, Muon, or others in the Lion-$\mathcal{K}$ family -- form a connected set at sufficiently large width, a result not implied by prior work. We then characterize how optimizer-induced regions interact: at large width two different regions can be disjoint or overlap depending on regularization, while in our small-width example AdamW and Muon converge to disconnected zero-loss components separated by a provable loss barrier. Empirically, in GPT-2 pretraining, we observe same-optimizer paths preserve each model's spectrum while cross-optimizer paths traverse a smooth transition. Our results reveal optimizer-dependent structure beyond classical mode connectivity literature.

2605.09990 2026-05-12 cs.CL

Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference

Sietse Schelpe

AI总结 本文提出了一种名为Merlin的确定性字节精确去重系统,旨在解决大型语言模型推理过程中因冗余文本带来的效率瓶颈问题。该系统采用优化的SIMD友好哈希算法,实现高效、精确的文本去重与上下文优化,特别适用于检索增强生成(RAG)等应用场景。实验表明,Merlin在不同冗余程度的数据集上均可实现显著的输入缩减,同时保持数据完整性,并支持通过模型上下文协议(MCP)进行高速、安全的部署。

Comments Preprint. Implementation and open-source community version available at: https://github.com/corbenicai/merlin-community - https://doi.org/10.5281/zenodo.20090991

详情
英文摘要

Data-intensive applications, ranging from large-scale retrieval systems to advanced data pipelines, are increasingly bottlenecked by the processing of highly redundant text corpora. We present Merlin, a local-first, agnostic, high-throughput deduplication and context optimization engine designed to mitigate these inefficiencies. Utilizing a highly optimized, SIMD-friendly open-addressing flat hash set combined with xxHash3-64, Merlin performs rapid, byte-exact deduplication of text passages and data chunks. While broadly applicable to any text-processing workflow, its impact is particularly pronounced in Large Language Model (LLM) ecosystems, such as Retrieval-Augmented Generation (RAG). Our empirical evaluations demonstrate an input reduction ranging from 13.9% in low-redundancy datasets to over 71% in high-redundancy pipelines, maintaining absolute data fidelity. Furthermore, we detail the system's integration architecture via the Model Context Protocol (MCP), enabling secure, zero-network-interception deployment across major IDEs and autonomous agents. This paper outlines the core algorithmic design, performance benchmarks, and the architectural principles required to process data at sustained speeds of up to 8.7 GB/s.

2605.09985 2026-05-12 cs.AI cs.LG cs.NE

Prospective Compression in Human Abstraction Learning

Leonardo Hernandez Cano, Ivan Zareski, Luisa El Amouri, Pinzhe Zhao, Max Mascini, Emanuele Sansone, Yewen Pu, Bonan Zhao, Marta Kryven

AI总结 本文研究人类在非平稳任务环境中如何逐步学习和构建可复用的抽象结构。作者提出,与现有算法基于过去任务进行回顾式压缩不同,人类更倾向于面向未来任务进行前瞻性压缩。通过视觉程序合成任务实验及计算模型对比,研究发现人类抽象行为能感知任务生成过程中的潜在非平稳结构,这一特性无法用传统回顾式压缩算法或基于大语言模型的归纳偏置加以解释。

Comments under review at neurips 2026

详情
英文摘要

A core challenge in program synthesis is online library learning: the incremental acquisition of reusable abstractions under uncertainty about future task demands. Existing algorithms treat library learning as retrospective compression over a static task distribution, where the learned library is determined by the corpus of past tasks. However, real-world learning domains are often non-stationary, with tasks arising from a generative process that evolves over time. We propose and test the hypothesis that in non-stationary domains human library learning selects abstractions prospectively: targeting compression of future tasks. We study this question using the Pattern Builder Task, a visual program synthesis paradigm in which participants construct increasingly complex geometric patterns from a small set of primitives, transformations, and custom helpers that carry forward across trials. Using this task, we conduct two experiments with complementary latent curricula, designed to dissociate between behaviors consistent with prospective compression, and alternative library learning accounts. Using six computational models spanning online library learning strategies, we show that human abstraction behavior reflects sensitivity to latent, non-stationary structure in the task-generating process. This behavior is consistent with prospective compression, and cannot be captured by existing retrospective compression-based algorithms, or inductive biases modeled by LLM-based program synthesis.

2605.09984 2026-05-12 cs.CV cs.AI cs.LG

Geometric 4D Stitching for Grounded 4D Generation

Sunwoo Park, Taesung Kwon, Jong Chul Ye

AI总结 本文提出了一种名为“几何4D缝合”的高效框架,用于解决现有4D场景生成方法中几何不一致和重建成本高的问题。该方法通过显式识别缺失的几何区域,并用几何基础的4D缝合进行补充,从而在保证几何一致性的同时,显著提升了4D场景生成的效率。此外,该方法还支持4D网格的迭代扩展和场景编辑,具有良好的实用性和扩展性。

详情
英文摘要

Recent 4D generation methods complete scene-level missing information using generative models and reconstruct the scene into radiance-based representations. However, these pipelines often present geometric inconsistencies in the generated content, and the radiance-based reconstruction requires expensive optimization. Furthermore, radiance-based representations often absorb these geometric inconsistencies into their view-dependent nature, failing to enforce the grounded geometric consistency. To address these issues, we propose Geometric 4D Stitching, an efficient framework that explicitly identifies missing geometric regions and complements them with geometrically grounded 4D stitches. As a result, our method constructs 4D scene representations in under 10 minutes on a single NVIDIA RTX 5090 GPU per one-step scene expansion, while improving geometric consistency. Moreover, we demonstrate that our explicit 4D stitching supports interative expansion of 4D mesh as well as 4D scene editing.

2605.09982 2026-05-12 cs.CV

ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning

Yuna Lee, Kyoungho Min, Yulhwa Kim

AI总结 本文提出了一种名为ERASE的两阶段视觉token剪枝框架,旨在解决视觉语言模型处理高分辨率图像时产生的大量视觉token带来的计算负担问题。该方法通过自适应剪枝策略,根据输入图像的复杂度识别并保留关键视觉token,在保持模型性能的同时显著减少token数量。实验表明,ERASE在Qwen2.5-VL-7B模型上以85%的剪枝率仍能保留89.46%的原始精度,优于现有最佳方法。

Comments 20 pages, 8 figures

详情
英文摘要

Recent advancements in Vision-Language Models (VLMs) enable large language models (LLMs) to process high-resolution images, significantly improving real-world multimodal understanding. However, this capability introduces a large number of vision tokens, resulting in substantial computational overhead. To mitigate this issue, various vision token pruning methods have been proposed. Nevertheless, existing approaches predominantly rely on learned semantic features within the model to capture visual redundancy. Moreover, they lack adaptive mechanisms to adjust pruning strategies according to the complexity of the input image. In this paper, we propose ERASE, a two-stage vision token pruning framework that identifies and retains salient tokens through pruning strategies adaptive to image complexity. Experiment results demonstrate that ERASE significantly reduces vision tokens while preserving accuracy. For Qwen2.5-VL-7B, at a token pruning ratio of 85\%, ERASE retains 89.46% of the original model accuracy, whereas the best prior method retains only 78.1%. Our code is available at https://github.com/Tuna-Luna/ERASE.

2605.09977 2026-05-12 cs.CV

INFANiTE: Implicit Neural representation for high-resolution Fetal brain spatio-temporal Atlas learNing from clinical Thick-slicE MRI

Xiaotian Hu, Mingxuan Liu, Hongjia Yang, Juncheng Zhu, Yijin Li, Yifei Chen, Haoxiang Li, Tongxi Song, Zihan Li, Yingqi Hao, Ziyu Li, Yujin Zhang, Gang Ning, Yi Liao, Haibo Qu, Qiyuan Tian

AI总结 该研究提出了一种名为INFANiTE的隐式神经表示框架,用于从临床厚切片MRI扫描中高效学习高分辨率胎儿脑时空图谱,解决了传统方法中耗时的切片到体积重建和迭代配准步骤的问题。该方法显著加速了图谱构建过程,实验表明其在稀疏数据条件下仍能保持较高的精度和生物学合理性,为大规模胎儿脑发育分析提供了可行的解决方案。

详情
英文摘要

Spatio-temporal fetal brain atlases are important for characterizing normative neurodevelopment and identifying congenital anomalies. However, existing atlas construction pipelines necessitate days for slice-to-volume reconstruction (SVR) to generate high-resolution 3D brain volumes and several additional days for iterative volume registration, thereby rendering atlas construction from large-scale cohorts prohibitively impractical. We address these limitations with INFANiTE, an Implicit Neural Representation (INR) framework for high-resolution Fetal brain spatio-temporal Atlas learNing from clinical Thick-slicE MRI scans, bypassing both the costly SVR and the iterative non-rigid registration steps entirely, thereby substantially accelerating atlas construction. Extensive experiments demonstrate that INFANiTE outperforms existing baselines in subject consistency, reference fidelity, intrinsic quality and biological plausibility, even under challenging sparse-data settings. Additionally, INFANiTE reduces the end-to-end processing time (i.e., from raw scans to the final atlas) from days to hours compared to the traditional 3D volume-based pipeline (e.g., SyGN), facilitating large-scale population-level fetal brain analysis. Our code is publicly available at: https://anonymous.4open.science/r/INFANiTE-5D74

2605.09976 2026-05-12 cs.CV

OZ-TAL: Online Zero-Shot Temporal Action Localization

Chaolei Han, Hongsong Wang, Xin Gong, Jie Gui

AI总结 本文提出了一种新的在线零样本时序动作定位任务(OZ-TAL),旨在在视频流处理过程中检测尚未见过的动作类别及其发生时间。为了解决现有方法在跨域视频中泛化能力不足的问题,作者设计了一个无需训练的框架,利用现成的视觉-语言模型并引入额外机制以增强视觉表示并减少其偏差。实验表明,该方法在THUMOS14和ActivityNet-1.3数据集上显著优于现有先进方法,确立了新的基准和对比基线。

详情
英文摘要

Online Temporal Action Localization (On-TAL) aims to detect the occurrence time and category of actions in untrimmed streaming videos immediately upon their completion. Recent advancements in this field focus on developing more sophisticated frameworks, shifting from Online Action Detection (OAD)-based aggregation paradigm to instance-level understanding. However, existing approaches are typically trained on specific domains and often exhibit limited generalization capabilities when applied to arbitrary videos, particularly in the presence of previously unseen actions. In this paper, we introduce a new task called Online Zero-shot Temporal Action Localization (OZ-TAL), which aims to detect previously unseen actions in an online fashion. Furthermore, we propose a training-free framework that leverages off-the-shelf Vision-Language Models (VLMs) while introducing additional mechanisms to enhance visual representations and mitigate their inherent biases. We establish new benchmarks and representative baselines for OZ-TAL on THUMOS14 and ActivityNet-1.3, and extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches under both offline and online zero-shot settings.

2605.09975 2026-05-12 cs.LG math.OC

Chebyshev Center-Based Direction Selection for Multi-Objective Optimization and Training PINNs

Hoyeol Yoon, Seoungbin Bae, Nam Ho-Nguyen, Dabeen Lee

AI总结 该论文研究了物理信息神经网络(PINNs)训练中多目标优化的方向选择问题,提出了基于切比雪夫中心的更新方向选择方法。通过将方向选择建模为对偶锥体中的切比雪夫中心问题,该方法在低维空间中高效求解,并保证了非凸情况下的收敛性。该方法统一了现有方法中的关键性质,提供了可解释的几何准则,并在多个基准测试中表现出优越的性能。

详情
英文摘要

Physics-informed neural networks (PINNs) are a promising approach for solving partial differential equations (PDEs). Their training, however, is often difficult because multiple loss terms induced by PDE residuals and boundary or initial conditions must be optimized simultaneously. To address this difficulty, existing approaches often construct update directions by explicitly enforcing particular desirable properties, such as scale robustness and simultaneous descent. While effective in many cases, such property-by-property designs can make it unclear which conditions are essential, what geometric principle determines the selected update direction, and how different methods are structurally related. In this work, we formulate update-direction selection for PINN training as a Chebyshev-center problem in the dual cone. The proposed formulation selects a normalized direction that maximizes the minimum distance to the cone facets. The resulting formulation admits an efficient dual problem in a much lower-dimensional space and yields a convergence guarantee in the nonconvex setting. It also recovers the key desirable properties targeted by existing approaches without imposing them separately; rather, they follow from the single geometric criterion underlying the formulation. This makes the selected direction interpretable through a single geometric rule and provides a unified basis for systematically comparing related direction-selection methods. Experiments on several PINN benchmarks further demonstrate strong empirical performance of the proposed method.

2605.09973 2026-05-12 cs.CL cs.AI

GLiNER2-PII: A Multilingual Model for Personally Identifiable Information Extraction

Urchade Zaratiana, Ash Lewis, George Hurn-Maloney

AI总结 本文提出GLiNER2-PII,一个用于多语言个人可识别信息(PII)提取的小型模型,能够识别42种不同类型的PII实体。为了解决训练数据稀缺和隐私风险的问题,研究者构建了一个包含4,910篇标注文本的多语言合成语料库,通过约束驱动的生成方法生成多样化、真实的示例。实验表明,GLiNER2-PII在SPY基准测试中取得了最高的跨度级F1分数,优于包括OpenAI隐私过滤器在内的多个对比系统。

Comments Under submission

详情
英文摘要

Reliable detection of personally identifiable information (PII) is increasingly important across modern data-processing systems, yet the task remains difficult: PII spans are heterogeneous, locale-dependent, context-sensitive, and often embedded in noisy or semi-structured documents. We present GLiNER2-PII, a small 0.3B-parameter model adapted from GLiNER2 and designed to recognize a broad taxonomy of 42 PII entity types at character-span resolution. Training such systems, however, is constrained by the scarcity of shareable annotated data and the privacy risks associated with collecting real PII at scale. To address this challenge, we construct a multilingual synthetic corpus of 4,910 annotated texts using a constraint-driven generation pipeline that produces diverse, realistic examples across languages, domains, formats, and entity distributions. On the challenging SPY benchmark, GLiNER2-PII achieves the highest span-level F1 among five compared systems, including OpenAI Privacy Filter and three GLiNER-based detectors. We publicly release the model on Hugging Face to support further research and practical deployment of open PII detection systems.

2605.09972 2026-05-12 cs.RO cs.CV

HiDrive: A Closed-Loop Benchmark for High-Level Autonomous Driving

Zhongyu Xia, Guanyu Zhu, Guo Tang, Wenhao Chen, Yongtao Wang

AI总结 HiDrive 是一个全新的闭环自动驾驶基准,旨在解决现有基准在场景多样性、对象种类和驾驶能力评估方面的不足。该基准特别强调长尾场景,引入了多种罕见物体和复杂交通情境,并扩展了对规则遵守、道德推理和应急决策等高级驾驶能力的评估。HiDrive 采用更先进的物理引擎,提供真实光照和高保真视觉渲染,为自动驾驶系统在真实复杂环境中的表现提供了更具挑战性的测试平台。

详情
英文摘要

End-to-end autonomous driving has witnessed rapid progress, yet existing benchmarks are increasingly saturated, with state-of-the-art models achieving near-perfect scores on widely used open-loop and closed-loop benchmarks. This saturation does not mean that the problem has been solved; instead, it reveals that current benchmarks remain limited in scenario diversity, object variety, and the breadth of driving capabilities they evaluate. In particular, they lack sufficient long-tail scenarios involving rare but safety-critical objects and fail to assess advanced decision-making such as legal compliance, ethical reasoning, and emergency response. To address these gaps, we propose HiDrive, a new closed-loop benchmark for end-to-end autonomous driving that emphasizes long-tail scenarios and a richer evaluation of driving capabilities. HiDrive introduces a diverse set of rare objects and uncommon traffic situations, and expands evaluation from basic driving skills to more advanced capabilities, including rule compliance, moral reasoning, and context-dependent emergency maneuvers. Correspondingly, we extend previous collision-avoidance-centered metrics into a comprehensive evaluation system that encompasses collision and braking, traffic-rule compliance, and moral-reasoning indicators. Built on a more advanced physics engine, HiDrive provides physically realistic lighting and high-fidelity visual rendering, offering a more challenging and realistic testbed for assessing whether autonomous driving systems can handle the complexity of real-world deployment. The HiDrive software, source code, digital assets, and documentation are available at https://github.com/VDIGPKU/HiDrive.

2605.09969 2026-05-12 cs.LG cs.CL

The Truth Lies Somewhere in the Middle (of the Generated Tokens)

Sophie L. Wang, Phillip Isola, Brian Cheung

AI总结 本文研究了如何将自回归生成的隐藏状态压缩为能够反映语言模型内部状态的表示。作者发现,通过对生成的隐藏状态进行均值池化,可以获得比单个token更具有语义信息的表示,并通过核对齐方法在语言、视觉和蛋白质领域进行了验证。研究还表明,生成token的表示优于提示token,并揭示了模型行为中可解释的动态特性。

详情
英文摘要

How should hidden states generated autoregressively be collapsed into a representation that reflects a language model's internal state? Despite tokens being generated under causal masking, we find that mean pooling across their hidden states yields more semantic representations than any individual token alone. We quantify this through kernel alignment to reference spaces in language, vision, and protein domains. The improvement through mean pooling is consistent with information being distributed across generated tokens rather than localized to a single position. Furthermore, representations derived from generated tokens outperform those from prompt tokens, and alignment across generation reveals interpretable dynamics in model behavior.

2605.09967 2026-05-12 cs.LG

Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions

Andrew Lee, Fernanda Viégas, Martin Wattenberg

AI总结 本研究探讨了语言模型中线性方向所表示的概念如何捕捉关系结构的问题。通过在结构明确的领域(如围棋游戏Othello)中训练模型,研究发现虽然模型的内部状态可以被线性解码,但其实际结构还包含张量积表示(TPR)。研究通过训练TPR探测器,揭示了线性探测器所捕捉的结构实际上是更复杂结构的投影,并展示了如何从TPR探测器的参数中直接恢复线性方向。这一发现表明,方向性表示可能是更结构化表征的投影。

详情
英文摘要

While researchers are finding concepts represented as linear directions in language models, a bag of linear directions fails to capture relational structure. To better understand this dichotomy, we study a model with known linear representations, but trained in a highly structured domain -- the board game Othello. While the model's internal board-state representation is linearly decodable, we find additional structure in the form of tensor product representations (TPRs). We train TPR probes to recover shared structure amongst the linear probes, yielding a factorization into square-embeddings, color-embeddings, and a binding matrix that composes them to construct the model's board-state representation. We find geometric signatures within the weights of our TPR probe that align with the structure of the board, but perhaps more importantly, that the linear probes can be recovered directly from the parameters of our TPR probe. Our findings suggest that directional representations may be projections of more structured underlying representations.

2605.09963 2026-05-12 cs.CV

Learning to Perceive "Where": Spatial Pretext Tasks for Robust Self-Supervised Learning

Yang Shen, Yusen Cai, Weronika Hryniewska-Guzik, Qing Lin, Mengmi Zhang

AI总结 现有自监督学习方法主要学习对象不变的表征,但往往忽视了物体部分之间的空间结构和关系。为解决这一问题,本文提出了一种空间感知的预训练任务——空间预测(SP),通过预测同一图像中两个解耦局部视图之间的相对位置和尺度,学习细粒度的空间依赖关系。实验表明,该方法在图像识别、细粒度分类、语义分割和深度估计等多个任务中均取得显著提升,并增强了模型在分布外场景下的鲁棒性。

详情
英文摘要

Existing self-supervised learning (SSL) methods primarily learn object-invariant representations but often neglect the spatial structure and relationships among object parts. To address this limitation, we introduce Spatial Prediction (SP), a spatially aware pretext regression task that predicts the relative position and scale between a pair of disentangled local views from the same image. By modeling part-to-part relationships in a continuous geometric space, SP encourages representations to capture fine-grained spatial dependencies beyond invariant categorical semantics, thereby learning the compositional structure of visual scenes. SP is implemented as a decoupled plug-in and can be seamlessly integrated into diverse SSL frameworks. Extensive experiments show consistent improvements across image recognition, fine-grained classification, semantic segmentation, and depth estimation, as well as substantial gains in out-of-distribution robustness for object recognition. To evaluate spatial reasoning, we introduce (1) a position and scale prediction task on image patch pairs and (2) a jigsaw understanding task requiring patch reordering and recognition after reconstruction. Strong performance on these tasks indicates improved spatial structure and geometric awareness. Overall, explicitly modeling spatial information provides an effective inductive bias for SSL, leading to more structured representations and better generalization. Code and models will be released.

2605.09959 2026-05-12 cs.LG cs.AI cs.CL cs.ET

G-Zero: Self-Play for Open-Ended Generation from Zero Data

Chengsong Huang, Haolin Liu, Tong Zheng, Runpeng Dai, Langlin Huang, Jinyuan Li, Zongxia Li, Zhepei Wei, Yu Meng, Jiaxin Huang

AI总结 本文提出了一种名为 G-Zero 的自演化框架,用于在无外部评估的情况下实现大语言模型的自主持续改进,尤其适用于开放性生成任务。其核心方法是引入 Hint-δ 内在奖励机制,通过生成模型自身预测差异来指导优化,并结合提案模型和生成模型的协同进化进行训练。该方法无需依赖外部判断器,有效避免了奖励黑客和能力瓶颈,为不可验证领域的模型自我进化提供了可扩展且鲁棒的解决方案。

详情
英文摘要

Self-evolving LLMs excel in verifiable domains but struggle in open-ended tasks, where reliance on proxy LLM judges introduces capability bottlenecks and reward hacking. To overcome this, we introduce G-Zero, a verifier-free, co-evolutionary framework for autonomous self-improvement. Our core innovation is Hint-$δ$, an intrinsic reward that quantifies the predictive shift between a Generator model's unassisted response and its response conditioned on a self-generated hint. Using this signal, a Proposer model is trained via GRPO to continuously target the Generator's blind spots by synthesizing challenging queries and informative hints. The Generator is concurrently optimized via DPO to internalize these hint-guided improvements. Theoretically, we prove a best-iterate suboptimality guarantee for an idealized standard-DPO version of G-Zero, provided that the Proposer induces sufficient exploration coverage and the data filteration keeps pseudo-label score noise low. By deriving supervision entirely from internal distributional dynamics, G-Zero bypasses the capability ceilings of external judges, providing a scalable, robust pathway for continuous LLM self-evolution across unverifiable domains.

2605.09956 2026-05-12 cs.CV cs.AI

SDTalk: Structured Facial Priors and Dual-Branch Motion Fields for Generalizable Gaussian Talking Head Synthesis

Peng Jia, Zhen Xiao, Jia Li, Xueliang Liu, Zhenzhen Hu, Lingyun Yu

AI总结 本文提出了一种名为SDTalk的单次拍摄3D高斯溅射(3DGS)框架,用于实现无需个性化训练即可泛化到未知身份的高质量实时说话头生成。该方法通过引入结构化面部先验和双分支运动场,分别提升头部重建的完整性与面部动态的细节表现,从而在视觉质量和推理效率方面优于现有方法。

Comments 5 pages, 4 figures, 4 tables

详情
英文摘要

High-quality, real-time talking head synthesis remains a fundamental challenge in computer vision. Existing reconstruction- and rendering-based methods typically rely on identity-specific models, limiting cross-identity generalization. To address this issue, we propose SDTalk, a one-shot 3D Gaussian Splatting (3DGS)-based framework that generalizes to unseen identities without personalized training or fine-tuning. Our framework comprises two modules with a two-stage training strategy. In the first stage, we incorporate structured facial priors into the reconstruction module and separately predict 3DGS parameters for visible and occluded regions, enabling complete head reconstruction from a single image. In the second stage, we introduce a dual-branch motion field to model coarse and fine facial dynamics, improving detail fidelity and lip synchronization. Experiments demonstrate that SDTalk surpasses existing methods in both visual quality and inference efficiency.

2605.09955 2026-05-12 cs.CL

Beyond Majority Voting: Agreement-Based Clustering to Model Annotator Perspectives in Subjective NLP Tasks

Tadesse Destaw Belay, Ibrahim Said Ahmad, Idris Abdulmumin, Abinew Ali Ayele, Alexander Gelbukh, Eusebio Ricárdez-Vázquez, Olga Kolesnikova, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam

AI总结 本文研究了在主观性自然语言处理任务中如何有效建模标注者之间的意见分歧问题。作者提出了一种基于共识的聚类方法,用于捕捉和建模不同标注者的观点差异,从而提升标签聚合的效果。通过在18种语言的40个数据集上进行实验,结果表明该方法相比传统的多数投票和单个标注者建模,能够更全面地利用标注者视角,显著提升分类性能。此外,研究还比较了多种聚合策略,发现多标签和多任务方法在处理聚类标注者时表现更优。

Comments Pre-MIT Press publication version

详情
英文摘要

Disagreement in annotation is a common phenomenon in the development of NLP datasets and serves as a valuable source of insight. While majority voting remains the dominant strategy for aggregating labels, recent work has explored modeling individual annotators to preserve their perspectives. However, modeling each annotator is resource-intensive and remains underexplored across various NLP tasks. We propose an agreement-based clustering technique to model the disagreement between the annotators. We conduct comprehensive experiments in 40 datasets in 18 typologically diverse languages, covering three subjective NLP tasks: sentiment analysis, emotion classification, and hate speech detection. We evaluate four aggregation approaches: majority vote, ensemble, multi-label, and multitask. The results demonstrate that agreement-based clustering can leverage the full spectrum of annotator perspectives and significantly enhance classification performance in subjective NLP tasks compared to majority voting and individual annotator modeling. Regarding the aggregation approach, the multi-label and multitask approaches are better for modeling clustered annotators than an ensemble and model majority vote.

2605.09954 2026-05-12 cs.RO cs.CV

JODA: Composable Joint Dynamics for Articulated Objects

Tianhong Gao, Cheng Yu, Yinghao Xu, Mengyu Chu

AI总结 本文提出JODA,一种用于生成关节级动力学的可组合框架,能够捕捉如摩擦保持、卡扣、软闭合等精细的机械行为。JODA通过结构化的三通道场描述关节自由度下的保守力、干摩擦和阻尼,结合形状约束的分段三次插值方法,实现了表达力强且可微分模拟的动力学建模。该方法支持从多模态输入中推断和优化关节动力学,为复杂机械系统的建模、编辑和优化提供了统一的接口。

详情
英文摘要

Articulated objects used in simulation and embodied AI are typically specified by geometry and kinematic structure, but lack the fine-grained dynamical effects that govern realistic mechanical behavior, such as frictional holding, detents, soft closing, and snap latching. Existing approaches either ignore the detailed structure of dynamics entirely, or use simple models with limited expressiveness. We introduce JODA, a framework for generating joint-level dynamics as a structured three-channel field over the joint degree of freedom, capturing conservative forces, dry friction, and damping. Instantiated using shape-constrained piecewise cubic interpolation (PCHIP), this formulation defines a compact and expressive function space that is both interpretable and compatible with differentiable simulation. Building on this representation, we develop methods for inferring and refining joint dynamics from multimodal inputs. Given visual observations and joint context, a vision-language model proposes structured dynamical primitives, which are composed into a unified dynamics field. The resulting representation supports both direct manipulation and gradient-based refinement. We demonstrate that JODA enables plausible and controllable modeling of diverse joint behaviors, providing a unified interface for inference, editing, and optimization. Code and example assets with their generated profiles will be released upon publication.

2605.09951 2026-05-12 cs.LG

Generating synthetic electronic health record data using agent-based models to evaluate machine learning robustness under mass casualty incidents

Roben Delos Reyes, Daniel Capurro, Nicholas Geard

AI总结 该研究提出了一种基于智能体的建模方法,用于生成合成电子健康记录(EHR)数据,以评估机器学习模型在大规模伤亡事件(MCI)等极端情况下的鲁棒性。研究利用真实EHR数据构建急诊科的智能体模型,模拟患者到达、资源容量和临床流程,并通过调整系统条件生成反映MCI场景的合成数据。实验表明,MCI条件下机器学习模型的召回率显著下降,突显了系统变化对模型性能和患者预后的影响,为提升医疗AI在复杂环境下的可靠性提供了新方法。

Comments 14 pages, 1 figure; accepted at CHIL 2026

详情
英文摘要

ML models in healthcare are typically evaluated using curated real-world EHR data. A key limitation of such evaluations is that they may fail to assess the robustness of ML models to changes in the data at deployment, which is a common issue because EHR data used for ML model development cannot capture all such changes. Mass casualty incidents (MCIs) caused by disasters are critical instances where this will be an issue, as they induce rare, uncertain, and novel changes to routine system conditions. Because real-world EHR data from MCIs are often limited or unavailable, assessing ML robustness under such conditions before deployment remains challenging. Here, we propose an agent-based modelling approach for generating synthetic EHR data to evaluate the robustness of ML models under MCI scenarios. We use real-world EHR data to develop and calibrate an agent-based model (ABM) of an emergency department (ED) that explicitly models patient arrivals, resource capacity, and clinical workflow. By changing these system conditions to reflect plausible MCI scenarios, the ED model generates synthetic versions of the real-world EHR data that exhibit shifts in system behaviour. Using these synthetic data, we test ML models for predicting length of stay. We observed consistent declines in recall under MCI conditions relative to baseline system conditions, resulting in an increase in the number of patients with prolonged length of stay that were missed by the ML models. These results highlight the impact of changes in system conditions on patient outcomes, EHR data, and ML model performance. Our work establishes ABM-based synthetic EHR data generation as a proactive and systematic approach for evaluating the robustness of ML models under MCI or other system conditions not captured in real-world EHR data, supporting the safer and more effective deployment of ML models in healthcare systems.

2605.09950 2026-05-12 cs.LG cs.AI

Novel GPU Boruta algorithms for feature selection from high-dimensional data

Xurui Li, Zhiguo Gan, Jiaming Zhang, Zheng Liu, Diannan Lu

AI总结 本文针对传统特征选择算法在CPU上处理高维数据时效率低下的问题,提出两种基于GPU加速的Boruta特征选择方法——Boruta-Permut和Boruta-TreeImp,分别基于特征排列重要性和不纯度减少重要性进行特征选择。实验表明,这两种方法在保持与原始Boruta算法相近选择精度的同时,显著提升了计算效率,为大规模数据的特征选择提供了高效且经济的解决方案。

Comments This paper has been submitted to the journal Data Mining and Knowledge Discovery, and a preprint is available for the authors' records

详情
英文摘要

Most feature selection algorithms, especially wrapper methods, run inefficiently on CPU based platforms because of their high computational complexity. This inefficiency makes them unsuitable for processing large scale datasets. To address this challenge, the present study proposed two GPU accelerated versions of the Boruta feature selection procedure, in which Boruta-Permut relies on permutation based feature importance and Boruta-TreeImp employs importance based on impurity reduction. To evaluate these methods we conducted experiments on both a self constructed dataset and several publicly available datasets. The experimental results show that the proposed GPU accelerated algorithms greatly improve computational efficiency while preserving feature selection accuracy comparable to the original Boruta algorithm. In our analysis we also observe that the impurity reduction based version can overestimate the importance of some features. Overall these findings suggest that performing Boruta feature selection on GPUs offers an effective and cost efficient solution for large scale data analysis, which is a good deal.

2605.09949 2026-05-12 cs.LG

From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

Zehao Li, Yasuhiro Yoshikai, Shumpei Nemoto, Hiroyuki Kusuhara, Tadahaya Mizuno

AI总结 该研究探讨了化学语言模型如何从分子字符串表示中学习化学意义,而非仅依赖表面字符串模式,特别关注手性这一具有挑战性的测试案例。研究提出了一个基于Transformer的SMILES翻译模型Pan-CORE,并通过高时间分辨率的训练过程分析,揭示了手性信息在模型训练中的学习机制。研究发现,手性信息的学习存在一个长期停滞后的突增现象,表明手性学习的困难不仅源于模型容量,还涉及手性约束的复杂性,研究进一步通过注意力动态、残差流轨迹和潜在空间几何分析,揭示了编码器在手性信息学习中的核心作用。

详情
英文摘要

Understanding how chemical language models (CLMs) learn chemical meaning from molecular string representations, rather than only surface-level string patterns, is an important question in chemical representation learning and machine learning for chemistry. Chirality provides a demanding test case: enantiomers can differ greatly in pharmacological activity and toxicity, yet CLMs often struggle to distinguish chiral configurations reliably. Here we present Pan-CORE (Pan-Chemical Omniscale Representation Engine), a family of autoregressive Transformer-based encoder-decoder models for SMILES translation, and use high-temporal-resolution checkpoint analysis to investigate how chiral information is learned during training. Across all tested Pan-CORE variants, we observe a reproducible jump-up in which chiral-token accuracy rises abruptly after a long plateau, suggesting that chiral learning stagnation is not explained by model capacity alone and instead reflects the complexity of chiral constraints. Analyses of attention dynamics, residual-stream trajectories, and latent-space geometry support an encoder-centered mechanism in which chiral-token representations undergo transient destabilization and reconstruction, seen as a V-shaped drop and recovery in vector norm and directional stability, together with a clear reorganization of chiral molecular representations in the latent space. Encoder-decoder cross-evaluation further supports the encoder-centered nature of the transition, and targeted attention-head ablation identifies a small set of chiral-sensitive heads whose removal selectively reduces chiral-token accuracy even in the fully trained model. These findings show that SMILES translation can serve as a useful experimental system for mechanistic analysis of semantic emergence in CLMs, with implications for interpretable chemical representation learning.

2605.09948 2026-05-12 cs.AI cs.CV cs.RO

LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

Boyang Shen, Kaixiang Yang, Hao Wang, Qiuyu Yu, Qiang Xie, Qiang Li, Zhiwei Wang

AI总结 当前视觉-语言-动作(VLA)模型通常将视觉-语言主干网络的最深层表示视为动作预测的最优输入,但机器人操作任务需要频繁的闭环空间调整,过度抽象可能浪费计算资源并削弱精确控制所需的底层几何线索。为此,本文提出LoopVLA,一种递归VLA架构,联合学习表示优化、动作预测与表示充分性估计,通过共享的Transformer块迭代优化多模态特征,并在每一步生成候选动作和充分性评分,从而动态决定是否需要进一步优化。实验表明,LoopVLA在保持任务成功率的同时显著提升了模型效率,参数量减少45%,推理吞吐量提升达1.7倍。

详情
英文摘要

Current Vision-Language-Action (VLA) models typically treat the deepest representation of a vision-language backbone as universally optimal for action prediction. However, robotic manipulation is composed of many frequent closed-loop spatial adjustments, for which excessive abstraction may waste computation and weaken low-level geometric cues essential for precise control. Existing early-exit strategies attempt to reduce computation by stopping at predefined layers or applying heuristic rules such as action consistency, but they do not directly answer when a representation is actually sufficient for action. In this paper, we present LoopVLA, a recurrent VLA architecture that jointly learns representation refinement, action prediction, and sufficiency estimation. LoopVLA iteratively applies a shared Transformer block to refine multimodal tokens, and at each iteration produces both a candidate action and a sufficiency score that estimates whether further refinement is necessary. By sharing parameters across iterations, LoopVLA decouples refinement from absolute layer indices and grounds sufficiency estimation in the evolving representation itself. Since sufficiency has no direct supervision, we introduce a self-supervised distribution alignment objective, where intermediate confidence scores are trained to match the relative action quality across refinement steps, thereby linking sufficiency learning to policy optimization signals. Experiments on LIBERO, LIBERO-Plus, and VLA-Arena show that LoopVLA pushes the efficiency-performance frontier of VLA policies, reducing parameters by 45% and improving inference throughput by up to 1.7 times while matching or outperforming strong baselines in task success.