URL PDF HTML ☆

赞 0 踩 0

2605.16842 2026-05-19 cs.AI

Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

草图然后绘画：用于扩散多模态大语言模型的分层强化学习

Siqi Luo, Jianghan Shen, Yi Xin, Huayu Zheng, Haoxing Chen, Yan Tai, Yue Li, Junjun He, Yihao Liu, Guangtao Zhai, Yuewen Cao, Xiaohong Liu

AI总结本文提出了一种分层强化学习方法HT-GRPO，通过Sketch-Then-Paint训练方案和分层信用分配机制，解决扩散多模态大语言模型在强化学习优化中的关键问题，提升图像质量和审美效果。

详情

AI中文摘要

扩散多模态大语言模型（dMLLMs）在图像生成方面具有强大能力，但通过强化学习（RL）进行优化仍是一个主要挑战。一个主要困难是单张图像可以通过许多不同的去屏蔽序列生成，这使得计算重要性比率往往不可行。此外，现有方法往往忽视dMLLMs的分层生成过程，其中早期标记定义全局布局，后期标记关注局部细节。通过给所有标记分配均匀奖励，这些现有方法未能反映每个标记对最终图像的实际贡献。为了解决这些问题，我们提出了Hierarchical Token GRPO（HT-GRPO），将此层次结构直接整合到策略优化过程中。我们的方法特征一个Sketch-Then-Paint训练方案，将更新过程分为三个不同的阶段：全局、结构和细化。我们还使用一个提示条件估计器来从完全遮蔽状态开始计算重要性比率。此外，我们引入了一种分层信用分配机制，优先考虑关键结构标记，以确保准确的奖励传播。使用两种流行的dMLLM骨干网络MMaDA和Lumina-DiMOO进行的实验表明，HT-GRPO在GenEval和DPG基准上取得了显著成效。在六个额外指标上的评估证实了在图像质量、美学和人类偏好方面的显著改进。

英文摘要

Diffusion Multi-Modal Large Language Models (dMLLMs) are powerful for image generation, but optimizing them through reinforcement learning (RL) remains a major challenge. One primary difficulty is that a single image can be generated through many different unmasking sequences, which makes calculating importance ratios often intractable. Additionally, existing methods tend to ignore the hierarchical generation process of dMLLMs, where early tokens define the global layout and later tokens focus on local details. By assigning uniform rewards to all tokens, these current methods fail to reflect the actual contribution of each token to the final image. To address these issues, we propose Hierarchical Token GRPO (HT-GRPO), which integrates this hierarchy directly into the policy optimization process. Our approach features a Sketch-Then-Paint training scheme that organizes updates into three distinct stages: global, structure, and refinement. We also use a prompt-conditioned estimator to calculate importance ratios starting from a fully masked state. Furthermore, we introduce a Hierarchical Credit Assignment mechanism that prioritizes key structural tokens to ensure accurate reward propagation. Experiments using two popular dMLLM backbones, MMaDA and Lumina-DiMOO, demonstrate that HT-GRPO achieves substantial gains on the GenEval and DPG benchmarks. Evaluations across six additional metrics confirm significant improvements in image quality, aesthetics, and human preference.

URL PDF HTML ☆

赞 0 踩 0

2605.16839 2026-05-19 cs.CL

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

CompactAttention: 加速分块预填的块-联合KV选择

Jiwon Song, Dongwon Jo, Beomseok Kang, Jae-Joon Kim

AI总结本文提出CompactAttention，一种基于块-联合KV选择的分块预填注意力机制，通过将二维块稀疏掩码作为KV选择信号，实现高效的注意力计算，从而在保持精度的同时提升2.72倍的注意力速度。

详情

AI中文摘要

分块预填已成为长上下文大语言模型广泛采用的服务策略，但在这种模式下高效计算注意力仍然具有挑战性。现有稀疏注意力方法主要针对一次性预填设计，无法有效转换为分块预填：块稀疏内核在查询长度受限于分块大小时效率降低，而细粒度模式搜索在每次分块累积KV缓存中重复时变得昂贵。QUOKA是一种近期针对分块预填的方法，避免了稀疏内核的开销，但依赖于查询子采样、令牌级的KV选择，这可能导致遗漏查询特定的KV条目并引入显式的KV复制开销。为了解决这些限制，我们提出了CompactAttention，一种基于块-联合KV选择的分块预填注意力机制。CompactAttention将二维块稀疏掩码作为KV选择信号，而不是直接的稀疏内核执行计划，并将其转换为GQA-aware的每组KV块表，通过Q块联合和组内联合。这种构造产生了最小的块表，保留了输入掩码所选择的所有KV块，在分页执行约束下，使所选KV块能够原地访问，而无需显式的KV压缩。在LLaMA-3.1-8B-Instruct上，CompactAttention在RULER基准测试中保持的精度接近密集注意力，同时在128K上下文长度下的分块预填中提供高达2.72倍的注意力加速。

英文摘要

Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed for one-shot prefill and do not translate efficiently to chunked prefill: block-sparse kernels lose efficiency when the query length is limited by the chunk size, while fine-grained pattern search becomes costly when repeated over the accumulated KV cache at every chunk. QUOKA, a recent method that directly targets chunked prefill, avoids sparse-kernel overhead but relies on query-subsampled, token-level KV selection, which can miss query-specific KV entries and introduce explicit KV-copy overhead. To address these limitations, we propose CompactAttention, a chunked-prefill attention mechanism based on Block-Union KV Selection. CompactAttention treats 2D block-sparse masks as KV-selection signals rather than direct sparse-kernel execution plans, and converts them into GQA-aware per-group KV block tables through Q-block union and intra-group union. This construction produces the minimal block tables that preserve all KV blocks selected by the input masks under paged execution constraints, enabling selected KV blocks to be accessed in place without explicit KV compaction. On LLaMA-3.1-8B-Instruct, CompactAttention maintains accuracy close to dense attention on the RULER benchmark while delivering up to 2.72$\times$ attention speedup at 128K context length under chunked prefill.

URL PDF HTML ☆

赞 0 踩 0

2605.16834 2026-05-19 cs.CV cs.AI cs.LG

Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data

基于有限数据的细粒度多模态对齐的相对表示学习

Shiwon Kim, Yu Rang Park

AI总结本文提出了一种基于相对表示的学习方法，用于在有限数据条件下实现细粒度多模态对齐，通过学习token级别的跨模态结构来提升零样本分类、跨模态检索和零样本分割任务的性能。

详情

AI中文摘要

多模态预训练展示了强大的泛化性能，但在缺乏配对数据的领域中，这种范式往往难以实施。一种有前景的替代方法是事后多模态对齐，它通过有限数量的配对示例分别对预训练的单模态编码器进行对齐。然而，现有方法主要关注全局表示的对齐，忽略了片段-token关系。这可能阻碍了需要细粒度跨模态匹配的任务的迁移，超越粗粒度样本层面的语义。为了解决这个问题，我们提出了一种事后对齐方法，通过相对表示学习token级别的跨模态结构。具体来说，我们通过图像和文本与每种模态空间中一组可学习锚点的token级相似性来表示它们，这些锚点被训练以诱导一致的跨模态相似性模式，以匹配对。尽管仅学习锚点而没有重大的投影层，我们的方法在零样本分类、跨模态检索和零样本分割任务中均显著优于现有方法。这突显了在有限配对数据下，建模细粒度跨模态结构对于有效事后多模态对齐的重要性。

英文摘要

Multimodal pre-training demonstrates strong generalization performance, but this paradigm is often impractical in domains where paired data are scarce. A promising alternative is post-hoc multimodal alignment, which aligns separately pre-trained unimodal encoders using a limited number of paired examples. However, existing methods focus primarily on aligning global representations, missing patch-token relations. This may hinder transfer to tasks that require fine-grained cross-modal matching beyond coarse sample-level semantics. To address this issue, we propose a post-hoc alignment method that learns token-level cross-modal structure using relative representations. Specifically, we represent images and texts through their token-level similarities to a set of learnable anchors in each modality space, which are trained to induce consistent cross-modal similarity patterns for matched pairs. Despite learning only the anchors without heavy projection layers, our approach consistently outperforms existing methods in zero-shot classification, cross-modal retrieval, and zero-shot segmentation by a substantial margin. This highlights the importance of modeling fine-grained cross-modal structure for effective post-hoc multimodal alignment with limited paired data.

URL PDF HTML ☆

赞 0 踩 0

2605.16832 2026-05-19 cs.CV

Coarse Semantic Injection for LLM-Conditioned Structured Indoor Prediction

粗粒度语义注入用于LLM条件的结构室内预测

Shuliang Zhu, Tomiwa Adey, Jinjia Zhou

AI总结本文提出了一种接口保持的语义增强方法，用于LLM条件的结构解码，通过将语义证据与点云表示关联，将其编码为RGBB点接口，以提升结构室内预测的精度，特别是在复杂场景中的门框定位和家具检测。

详情

AI中文摘要

大型语言模型（LLMs）最近被用作结构解码器，用于从3D点云输入中进行室内理解。然而，点云编码器在体素化和稀疏池化后，往往低估了如门和窗等细长结构元素，并可能在拥挤场景中遗漏单个家具实例。我们提出了一种接口保持的语义增强方法，用于LLM条件的结构解码。关键思想是将语义证据与点云表示关联，将其缩减为粗粒度四组代码（家具、墙壁、开口和其他），并将其编码为RGBB点接口：红色表示家具，绿色表示墙壁，蓝色表示开口，黑色表示其他，其中RGBB表示在三个RGB通道中用三种颜色表示四种语义状态，而不是额外的第四通道。该语义颜色代码在原始原始点属性后附加，因此几何和语义共享相同的稀疏标记化路径，同时下游语言模型解码器和输出序列化保持不变。我们进一步引入了一个轻量级的路由语义位移模块，其辅助头仅用于训练时的比率/预算正则化和分析，以在稀疏池化后加强语义线索。整体流程可以使用RGB衍生的语义证据。在这些受控的语义源设置下，报告的指标在Structured3D、SpatialLM数据集和ARKitScenes上均有所提升，尤其是在拥挤场景中的开口定位和单个家具检测。消融实验澄清了语义源、颜色编码、标记融合和位移注入的作用，同时显示颜色/熵效应仍然非平凡。

英文摘要

Large language models (LLMs) have recently been used as structured decoders for indoor understanding from 3D point-token inputs. However, point cloud encoders often under-represent thin structural elements such as doors and windows after voxelization and sparse pooling, and may miss individual furniture instances in cluttered scenes. We propose an interface-preserving semantic augmentation for LLM-conditioned structured decoding. The key idea is to associate semantic evidence with the point-cloud representation, reduce it to a coarse four-group code (furniture, walls, openings, and others), and encode it as an RGBB point interface: red for furniture, green for walls, blue for openings, and black for others, where RGBB denotes four semantic color states represented in three RGB channels rather than an additional fourth channel. This semantic color code is appended to the original raw point attributes before tokenization, so geometry and semantics share the same sparse tokenization path while the downstream language model decoder and output serialization remain unchanged. We further introduce a lightweight routed semantic shift module, with an auxiliary head used only for training-time ratio/budget regularization and analysis, to strengthen semantic cues after sparse pooling. The overall pipeline can use RGB-derived semantic evidence. Under these controlled semantic-source settings, the reported metrics improve across Structured3D, the SpatialLM dataset, and ARKitScenes, especially for opening localization and per-instance furniture detection in cluttered scenes. Ablations clarify the roles of semantic source, color coding, token fusion, and shift injection, while also showing that color/entropy effects remain nontrivial.

URL PDF HTML ☆

赞 0 踩 0

2605.16829 2026-05-19 cs.CL cs.PL

AgentKernelArena: GPU核优化代理的通用化意识基准测试

Sharareh Younesian, Wenwen Ouyang, Sina Rafati, Mehdi Rezagholizadeh, Sharon Zhou, Ji Liu, Yue Liu, Yuchen Yang, Hao Li, Ziqiong Liu, Dong Li, Vikram Appia, Zhenyu Gu, Emad Barsoum

AI总结本文提出AgentKernelArena，一个用于评估GPU核优化代理的开源基准，通过隔离工作区和统一评分机制，测试代理在不同任务和硬件目标上的性能和通用化能力，发现大多数任务在正确性和编译效率上表现优异，但在PyTorch到HIP的转换任务中存在显著的正确性下降。

详情

AI中文摘要

GPU核优化对于高效深度学习系统日益关键，但编写高性能核仍然需要大量的低级专业知识。最近的AI编码代理可以迭代阅读代码、调用编译器和性能分析器，并优化实现，但现有的核基准测试仅评估单个LLM调用而非完整的代理工作流程，且未包含核到核的优化和未见过的配置泛化测试。我们提出了AgentKernelArena，一个开源的基准测试，用于衡量AI编码代理在GPU核优化上的能力。该基准测试包含196个任务，涵盖HIP到HIP的优化、Triton到Triton的优化以及PyTorch到HIP的转换，并在隔离的工作区中使用门控编译、正确性和性能检查，集中评分和一个未见过的配置泛化协议，测试优化是否转移到代理从未见过的输入配置。在包括Cursor Agent、Claude Code和Codex Agent在内的生产代理中，我们发现大多数任务在正确性和编译效率上表现优异，最强配置在PyTorch到HIP任务中平均加速达6.89倍，在HIP到HIP任务中达6.69倍，在Triton到Triton任务中达2.13倍。我们的未见过的配置评估显示，HIP到HIP和Triton到Triton的优化大多能转移到未见过的输入形状，而PyTorch到HIP的转换则表现出显著的正确性下降，表明生成核的代理经常硬编码形状特定的假设。AgentKernelArena被设计为一个模块化、可扩展的框架，用于严格评估跨代理、任务和硬件目标的代理GPU核优化。

英文摘要

GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers and profilers, and refine implementations, yet existing kernel benchmarks evaluate single LLM calls rather than full agent workflows, and none include both kernel-to-kernel optimization and unseen-configuration generalization testing. We present AgentKernelArena, an open-source benchmark for measuring AI coding agents on GPU kernel optimization. The benchmark contains 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation, and evaluates complete agent workflows in isolated workspaces using gated compilation, correctness, and performance checks, centralized scoring and an unseen-configuration generalization protocol that tests whether optimizations transfer to input configurations the agent never observed. Across production agents including Cursor Agent, Claude Code, and Codex Agent, we find near-perfect compilation and high correctness rates on most task categories, with the strongest configurations achieving mean speedups of up to 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks. Our unseen-configuration evaluation shows that HIP-to-HIP and Triton-to-Triton optimizations largely transfer to unseen input shapes, while PyTorch-to-HIP exhibits substantial correctness drops, indicating that agents generating kernels from scratch frequently hardcode shape-specific assumptions. AgentKernelArena is designed as a modular, extensible framework for rigorous evaluation of agentic GPU kernel optimization across agents, tasks, and hardware targets.

URL PDF HTML ☆

赞 0 踩 0

2605.16818 2026-05-19 cs.CV cs.AI

Observation-Aligned Mask Priors for Learning Physical Dynamics from Authentic Occlusions

基于观测对齐的遮罩先验学习物理动态的遮罩方法

Chiyuan Ma, Zihan Zhou, Tianshu Yu

AI总结本文提出了一种基于观测对齐的遮罩先验方法，通过学习真实的遮罩分布来构建上下文-查询分区，从而在不完整数据上训练物理动态学习。该方法利用贝叶斯流网络预训练二进制遮罩，结合全局归一化交叉熵目标生成与稀疏观测对齐的样本特定遮罩，从而避免零查询死区和局部生成崩溃。

详情

AI中文摘要

直接从不完整观测中学习物理动态具有挑战性，因为真实的遮罩是结构化的、样本依赖的，并且常常不是随机缺失的，而现有方法通常依赖启发式遮罩规则或预定义的遮罩分布。我们提出Observation-Aligned Mask Priors框架，该框架学习真实的观测遮罩分布，并利用其构建上下文-查询分区以从不完整数据中训练。具体来说，我们先在二进制观测遮罩上预训练一个贝叶斯流网络（BFN）以捕捉真实的遮罩拓扑结构，然后通过全局归一化交叉熵目标引导BFN采样，生成与每个稀疏观测对齐的样本特定遮罩。遮罩与观测遮罩的交集定义为上下文，剩余的观测条目成为扩散模型的查询目标。我们证明，这种基于交集的分区使每个有效的观测维度都有严格正的概率被查询，防止零查询死区和局部生成崩溃。在三个具有真实卫星遮罩的现实世界海洋学数据集上，跨分辨率至256×256的实验显示，在MSE和PSNR上优于强扩散基线的一致改进。这些结果表明，从真实遮罩中学习遮罩先验是学习不完整物理观测的有效替代方法，无需访问完全观测的场数据。

英文摘要

Learning physical dynamics directly from incomplete observations is challenging because authentic occlusions are structured, sample-dependent, and often missing not at random, whereas existing methods typically rely on heuristic masking rules or predefined mask distributions. We propose Observation-Aligned Mask Priors, a framework that learns the distribution of authentic observation masks and uses it to construct context-query partitions for training from incomplete data. Specifically, we pretrain a Bayesian Flow Network (BFN) on binary observation masks to capture real occlusion topologies, then guide BFN sampling with a globally normalized cross-entropy objective to generate sample-specific masks aligned with each sparse observation. The intersection between the guided mask and the observed mask defines the context, and the remaining observed entries become query targets for a diffusion-based reconstruction model. We show that this intersection-based partitioning gives every valid observed dimension a strictly positive probability of being queried, preventing zero-query dead zones and local generative collapse. Experiments on three real-world oceanographic datasets with authentic satellite occlusions, across resolutions up to 256$\times$256, show consistent improvements over strong diffusion baselines in MSE and PSNR. These results demonstrate that learning mask priors from authentic occlusions is an effective alternative to heuristic masking for learning from incomplete physical observations without access to fully observed fields.

URL PDF HTML ☆

赞 0 踩 0

2605.16810 2026-05-19 cs.CV

Training-Free Occluded Text Rendering via Glyph Priors and Attention-Guided Semantic Blending

无需训练的遮挡文本渲染：通过字形先验和注意力引导的语义融合

Jingqi Hou, Hongtian Wang

AI总结本文提出一种无需训练的遮挡文本渲染框架，通过预训练的FLUX.1-dev模型，解决文本生成中遮挡物位置和文本结构稳定性问题，采用双流推理和字形先验稳定文本结构，提升文本可读性和遮挡对齐效果。

Comments 9 pages, 3 figures, 3 tables

详情

AI中文摘要

我们提出一种无需训练的遮挡文本渲染框架，使用预训练的FLUX.1-dev主干网络。该任务要求模型生成可识别的字体并放置遮挡物在预期文本区域。现有文本到图像生成器在这一设置中仍然具有挑战性：遮挡物往往远离文本，而文本可能被扭曲或漂浮在遮挡物之上。为了解决这个问题，我们提出了一个重启双流推理框架，将文本布局保持与遮挡物插入解耦。基流提供干净的字形参考和相同步骤的键/值（K/V）特征，而编辑流则基于遮挡提示进行条件化。我们进一步采用来自FreeText的光谱字形先验思想，并将其适应于早期到中期去噪过程中稳定目标文本结构。在推理过程中，我们的方法局部化目标文本，从令牌条件化的注意力和字形支持中估计文本带区域，并推导出一个锚点感知的硬融合掩码用于遮挡物。在最终的编辑过程中，生成从相同的初始噪声开始，并在选定的注意力站点应用硬掩码引导的图像-令牌K/V替换，保持基流布局在掩码外，同时在掩码内注入来自编辑流的遮挡物外观。在代表性遮挡文本场景的实验中，显著提高了文本可读性，并在遮挡对齐方面具有竞争力，从而在不进行模型微调的情况下实现了更稳定的物体-文本组合。

英文摘要

We present a training-free framework for occluded text rendering with a pretrained FLUX.1-dev backbone. The task requires a model to render recognizable typography and place an occluding object over the intended text region. This setting remains difficult for existing text-to-image generators: the occluder often drifts away from the text, while the text may be distorted or appear to float on top of the occluding object. To address this problem, we propose a restarted dual-stream inference framework that decouples text-layout preservation from occluder insertion. A Base Stream provides a clean typographic reference and same-step key/value (K/V) features, while the Edit Stream is conditioned on the occlusion prompt. We further adopt the spectral glyph-prior idea from FreeText and adapt it to stabilize the target text structure during early-to-mid denoising. In the reasoning pass, our method localizes the target text, estimates a text-band region from token-conditioned attention and glyph support, and derives an anchor-aware hard fusion mask for the occluder. In the final edit pass, generation restarts from the same initial noise and applies hard mask-guided image-token K/V replacement at selected attention sites, preserving the Base layout outside the mask while injecting the occluder appearance from the Edit Stream inside the mask. Experiments on representative occluded text scenarios demonstrate substantially improved text readability and competitive occlusion alignment, yielding more stable object-on-text compositions without any model fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2605.16809 2026-05-19 cs.LG

Informative Graph Structure Learning

信息导向的图结构学习

Shen Han, Zhiyao Zhou, Jiawei Chen, Sheng Zhou, Canghong Jin, Hai Lin, Da Zhong Li, Bingde Hu, Can Wang

AI总结本文提出了一种信息导向的图结构学习方法（InGSL），通过结合相似性和多样性来优化图结构，减少边数并提高性能。

详情

AI中文摘要

图结构数据的质量对现代图分析技术如图神经网络（GNNs）的成功至关重要。然而，现实中的图数据往往质量不佳，存在噪声和连接不完整等问题。图结构学习（GSL）作为一种适应性优化节点连接的技术已崭露头角。然而，我们发现GSL的效果常常以边数大幅增加为代价，导致存储和计算开销显著增加。在本工作中，我们揭示这一限制源于广泛使用的基于相似性的边构造方法，该方法主要基于嵌入连接高度相似的邻居，引入了大量结构冗余。为了解决这一问题，我们提出了一种新颖的信息导向图结构学习方法（InGSL），通过引入互信息引导的学习策略，同时考虑相似性和多样性进行边构造。值得注意的是，InGSL作为一种可插拔模块，能够无缝集成到现有的GSL框架中。通过在六个代表性GSL方法上的广泛实验，我们证明InGSL在减少边数的同时实现了显著的性能提升。

英文摘要

The quality of graph-structured data is fundamental to the success of modern graph analysis techniques such as Graph Neural Networks (GNNs). However, real-world graph data is often suboptimal, suffering from issues such as noise and incomplete connections. Graph Structure Learning (GSL) has emerged as a promising technique that adaptively optimizes node connections. However, we observe that the effectiveness of GSL often comes at the cost of a dramatic expansion in edge count, resulting in significant storage and computational overhead. In this work, we reveal that this limitation stems from the prevalent use of similarity-based edge construction, which predominantly connects highly similar neighbors based on their embeddings, introducing substantial structure redundancy. To address this, we propose a novel Informative Graph Structure Learning method (InGSL), which jointly considers both similarity and diversity in edge construction by incorporating a mutual-information-guided learning strategy. Notably, InGSL serves as a plug-in module that can be seamlessly integrated into existing GSL frameworks. Through extensive experiments on six representative GSL methods, we demonstrate that InGSL achieves significant performance improvements at a reduced number of edges.

URL PDF HTML ☆

赞 0 踩 0

2605.16807 2026-05-19 cs.CV

Lever：智能手机上的推测LLM推理

Tuowei Wang, Fengzu Li, Yanfan Sun, Wei Gao, Ju Ren

AI总结本文提出Lever系统，通过联合优化推测解码的三个阶段，在智能手机上实现高效的闪存支持的LLM推理，显著降低了推理延迟。

详情

AI中文摘要

大型语言模型（LLMs）在交互式移动应用中需求日益增加，但高质量模型超出了智能手机上有限的DRAM容量。闪存可以容纳更大的模型，但闪存支持的推理速度慢，因为自回归解码反复调用目标模型并产生昂贵的I/O。我们观察到推测解码非常适合这种环境：一个小型草稿模型可以保留在DRAM中，而一个更大的驻留于闪存的目标模型在每次调用中验证多个候选令牌。然而，现有方法假设服务器级加速器，并未考虑长时间I/O延迟、有限的计算并行性和不规则的推测执行。我们提出了Lever，一个用于智能手机上高效闪存支持LLM推理的端到端系统。Lever在移动约束下联合优化推测解码的三个阶段。在草稿阶段，它使用I/O和计算感知的增益-成本目标构建令牌树。在验证阶段，它通过早期退出预测修剪低价值分支以减少目标模型计算。在执行阶段，它将推测高效地映射到移动CPU-NPU硬件以提高利用率。全面评估显示，Lever将推理延迟降低了2.93倍于基准闪存卸载推理，1.50倍于传统推测解码，缩小了闪存支持与内存驻留LLM推理之间的延迟差距。

英文摘要

Large language models (LLMs) are increasingly needed for interactive mobile applications, but high-quality models exceed the limited DRAM available on smartphones. Flash storage can hold larger models, yet flash-backed inference is slow because autoregressive decoding repeatedly invokes the target model and incurs costly I/O. We observe that speculative decoding is a natural fit for this setting: a small draft model can remain in DRAM, while a larger flash-resident target model verifies multiple candidate tokens per invocation. However, existing methods assume server-class accelerators and fail to account for prolonged I/O latency, limited computation parallelism, and irregular speculation execution. We present Lever, an end-to-end system for efficient flash-backed LLM inference on smartphones. Lever jointly optimizes the three stages of speculative decoding under mobile constraints. For drafting, it builds token trees using an I/O- and compute-aware gain-cost objective. For verification, it prunes low-value branches through early-exit prediction to reduce target-model computation. For execution, it maps speculation efficiently across mobile CPU-NPU hardware to improve utilization. Comprehensive evaluations show that Lever reduces inference latency by an average of 2.93x over baseline flash-offloaded inference and 1.50x over conventional speculative decoding, narrowing the latency gap between flash-backed and memory-resident LLM inference.

URL PDF HTML ☆

赞 0 踩 0

2605.16785 2026-05-19 cs.CV cs.AI

Encoding Robust Topological Signatures for Hyperdimensional Computing

为超维计算编码鲁棒的拓扑特征

Arpan Kusari

AI总结本文提出了一种基于拓扑特征的超维计算方法，通过提取离散拓扑原始特征并结合RTS不变的形状签名，提高了超维计算在旋转、噪声和遮挡等扰动下的鲁棒性，实验表明其在多个数据集上优于传统方法。

详情

AI中文摘要

超维（HD）计算由于其简单性、快速的原型基推断和与在线更新的兼容性，为边缘学习提供了一个有吸引力的替代方案。然而，标准的基于像素的HD编码器容易受到分布偏移的影响，如旋转、噪声或遮挡，会显著降低准确性。我们从二值化形状中提取离散拓扑原始特征——尤其是孔洞，并将它们与旋转/平移/缩放（RTS）不变的形状签名配对。我们的方法为（i）外轮廓使用空间金字塔变体的Zernike矩构建RTS稳定的描述符，（ii）每个孔洞使用其径向签名的内在傅里叶描述符以及RTS-标准相对几何。每个原始特征通过随机投影和角色绑定映射到双极超向量，并通过排列不变的捆绑聚合变量卡数的孔洞集以形成单个图像超向量。为了避免过度加权任何线索，我们通过在验证集上融合余弦相似度学习Zernike和孔洞通道的非负可靠性权重。在MNIST和EMNIST数据集上进行的实验表明，拓扑引导的HD计算相比传统HD基线显著提高了鲁棒性，保持了多个扰动家族的高精度，并受益于轻量级在线训练。与在干净数据上训练的紧凑CNN相比，我们的方法在清洁精度上具有竞争力，同时对几种像素级扰动具有明显更强的鲁棒性，证明了显式拓扑结构是实现鲁棒HD表示的可行途径。代码在https://github.com/arpan-kusari/Topological-HDC提供。

英文摘要

Hyperdimensional (HD) computing offers an attractive alternative to deep networks for edge learning due to its simplicity, fast prototype-based inference, and compatibility with online updates. However, standard pixel-based HD encoders are brittle: small distribution shifts such as rotation, noise, or occlusion can drastically reduce accuracy. We extract discrete topological primitives-most notably holes-from binarized shapes and pair them with rotation/translation/scale (RTS)-invariant shape signatures. Our method constructs RTS-stable descriptors for (i) the outer shape using a spatial-pyramid variant of Zernike moments and (ii) each hole using an intrinsic Fourier descriptor of its radial signature together with RTS-canonical relative geometry. Each primitive is mapped to a bipolar hypervector via randomized projection and role binding, and variable-cardinality hole sets are aggregated by permutation-invariant bundling to form a single image hypervector. To avoid over-weighting any cue, we learn nonnegative reliability weights for the Zernike and hole channels on a validation set via late fusion of cosine similarities. Experiments on MNIST and EMNIST under controlled corruptions (rotation, Gaussian noise, salt-and-pepper, cutout, zoom) show that Topology-guided HD computing substantially improves robustness compared with a naive HD baseline, maintaining high accuracy across multiple corruption families and benefiting from lightweight online training. Compared with a compact CNN trained on clean data, our method achieves competitive clean accuracy while offering markedly stronger robustness to several pixel-level corruptions, demonstrating that explicit topological structure is a practical route to robust HD representations. The code is provided at https://github.com/arpan-kusari/Topological-HDC.

URL PDF HTML ☆

赞 0 踩 0

2605.16779 2026-05-19 cs.CV cs.AI

A Holistic Method for Superquadric Fitting Using Unsupervised Clustering Analysis

一种基于无监督聚类分析的超二次曲面拟合整体方法

Mingyang Zhao, Sipu Ruan, Xiaohong Jia

AI总结本文提出了一种新的方法，用于在存在噪声和异常值的情况下对点云进行超二次曲面拟合，通过无监督聚类分析重新定义问题，实现了刚性和变形超二次曲面的一体化拟合，同时提供了闭式解析解和收敛性证明。

Comments 20 pages, Code: https://github.com/zikai1/SuperquadricFitting

详情

Journal ref: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2026

AI中文摘要

本文提出了一种新的方法，用于在存在噪声和异常值的情况下对点云进行超二次曲面拟合，该方法在多个领域具有广泛的应用。与以往仅专注于拟合刚性或变形超二次曲面或存在鲁棒性和数值稳定性问题的方法不同，我们的方法从无监督聚类的新视角重新定义问题，使刚性和变形超二次曲面的拟合能够在统一的框架中完成。我们的方法核心是一种受无监督聚类分析启发的稳定优化函数，其中我们将点云数据和潜在参数曲面的样本分别作为聚类成员和质心。然后，具有动态更新质心位置的聚类过程成为优化超二次曲面参数的直接代理，建立了几何拟合与聚类动态之间的原则性联系。我们进一步推导了聚类质心与聚类成员之间的成对计算与正交距离之间的关系，从而有效消除了耗时的曲面采样过程。此外，我们的公式为模糊成员度向量和协方差矩阵提供了闭式解析解，确保了高效迭代优化，并能够更有效地处理几何变形。此外，我们还提供了收敛性分析的理论证明，并证明了聚类启发的拟合方法通过内在增加目标函数的凸性来逃避局部极小值。实现已公开在https://github.com/zikai1/SuperquadricFitting。

英文摘要

This work presents a novel method for fitting superquadrics to point clouds under the contamination of noise and outliers, which has many applications for shape modeling across diverse fields. Unlike prior approaches that either exclusively focus on fitting rigid or deformable superquadrics, or suffer from robustness and numerical instability issues, our method redefines the problem from a new unsupervised clustering perspective, enabling the holistic fitting of both rigid and deformable superquadrics within a unified framework. Central to our approach is a stable optimization function inspired by unsupervised clustering analysis, where we formulate the point cloud data and samples from the potential parametric surface as clustering members and centroids, respectively. Then, the clustering process with dynamic updates to centroid locations serves as a direct proxy for optimizing superquadric parameters, establishing a principled link between geometric fitting and clustering dynamics. We further derive the relationship between pairwise computations of clustering centroids and clustering members to orthogonal distances, effectively eliminating the need for the time-consuming surface sampling process. Moreover, our formulation provides closed-form analytical solutions for both the fuzzy membership degree vector and the covariance matrix, ensuring efficient iteration optimization and enabling more effective handling of geometric deformations. In addition, we provide a theoretical certificate of convergence analysis and demonstrate that the clustering-inspired fitting method can escape local minima by inherently increasing the convexity of the objective function. The implementation is publicly available at https://github.com/zikai1/SuperquadricFitting.

URL PDF HTML ☆

赞 0 踩 0

2605.16776 2026-05-19 cs.LG cs.AI

Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

可区分删除：统一知识擦除与拒绝用于大语言模型去学习

Puning Yang, Junchi Yu, Qizhou Wang, Philip Torr, Bo Han, Xiuying Chen

AI总结本文提出D^2方法，通过限制潜在表示中的响应分布来擦除不受欢迎的知识，同时区分保留知识，从而实现安全且一致的拒绝机制，以提高大语言模型去学习的效果。

Comments ICML2026 Accepted

详情

AI中文摘要

减轻敏感和有害输出对于确保大型语言模型（LLM）的安全部署至关重要。现有方法通常遵循两种范式：知识删除（KD），在训练期间擦除不受欢迎的信息，以及可区分拒绝（DR），在推理期间引导模型远离使用敏感知识。尽管进展迅速，基于KD的去学习在抑制特定令牌序列作为完整知识移除替代物时面临偏见删除的问题，而基于DR的去学习则因底层知识仍然完整而有重新出现有害知识的风险。为了解决这些问题，我们提出了可区分删除（D^2），一种通过限制潜在表示中的响应分布来擦除不受欢迎知识，同时区分保留知识的范式，从而能够安全且一致地处理去学习的输入。为了实现D^2，我们引入了一个能量指数，该指数量化了知识的存在以及去学习内容与保留内容之间的分离。数学和实证分析表明，能量既准确又高效，使能量基于去学习对齐（EUA）能够在训练期间强制执行能量边界去学习，并在推理时应用基于能量的拒绝机制。广泛的实验表明，EUA显著优于先前方法，表明D^2的优越性。我们的代码可在https://github.com/Puning97/EUA-for-LLM-Unlearning获取。

英文摘要

Mitigating sensitive and harmful outputs is fundamental to ensuring safe deployment of LLMs. Existing approaches typically follow two paradigms: Knowledge Deletion (KD), which erases undesirable information during training, and Distinguishable Refusal (DR), which steers models away from using sensitive knowledge during inference. Despite rapid progress, KD-based unlearning struggles with biased deletion due to suppressing specific token sequences as a substitute for complete knowledge removal, whereas DR-based unlearning risks the re-emergence of harmful knowledge because the underlying knowledge remains intact. To address these issues, we propose Distinguishable Deletion ($\mathrm{D^2}$), a paradigm that restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a refusal mechanism to handle unlearned inputs safely and coherently. To implement $\mathrm{D^2}$, we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses show that energy is both accurate and efficient, enabling Energy-based Unlearning Alignment (EUA) to enforce energy-boundary unlearning during training and apply an energy-based refusal mechanism at inference. Extensive experiments demonstrate that EUA significantly outperforms previous methods, indicating the superiority of $\mathrm{D^2}$. Our code is available at https://github.com/Puning97/EUA-for-LLM-Unlearning.

URL PDF HTML ☆

赞 0 踩 0