arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3814
2605.16857 2026-05-19 cs.AI

Learning to Learn from Multimodal Experience

从多模态经验中学习学习

Xingyu Sui, Weixiang Zhao, Yongxin Tang, Yanyan Zhao, Yang Wu, Dandan Tu, Bing Qin

AI总结 本文提出了一种新的学习范式,即从多模态经验中学习,通过动态构建和利用记忆来提升智能体的性能和泛化能力,解决了传统固定记忆设计在多模态环境中的不足。

详情
AI中文摘要

经验驱动学习已成为一种有前景的范式,使智能体能够通过积累和重用过去经验来改进。然而,现有方法主要在文本环境中开发,并依赖于手动设计的记忆架构,限制了它们在多模态环境中的适用性。在现实场景中,经验本质上是多模态的,涉及感知、推理和行动中的异构信号,这使得有效记忆设计变得更加具有挑战性。特别是,最优的多模态经验结构和利用方式高度依赖于任务,并随时间变化,使得固定记忆设计不足。在本文中,我们提出了一种新的范式,即从多模态经验中学习,将记忆设计从预定义的组件转变为适应性和可学习的过程。我们的框架使智能体能够根据任务需求和交互历史动态构建、组织和利用记忆,有效学习如何结构化经验以提高性能。实验表明,适应性记忆设计显著增强了智能体在多模态任务中的性能和泛化能力,突显了学习记忆机制在经验驱动学习中的关键作用。

英文摘要

Experience-driven learning has emerged as a promising paradigm for enabling agents to improve from interaction trajectories by accumulating and reusing past experience. However, existing approaches are predominantly developed in textual settings and rely on manually designed memory schemas, limiting their applicability to multimodal environments. In real-world scenarios, experience is inherently multimodal, involving heterogeneous signals across perception, reasoning, and action, which makes effective memory design significantly more challenging. In particular, the optimal way to structure and utilize multimodal experience is highly task-dependent and evolves over time, rendering fixed memory designs insufficient. In this work, we propose a new paradigm, learning to learn from multimodal experience, which shifts memory design from a predefined component to an adaptive and learnable process. Our framework enables agents to dynamically construct, organize, and utilize memory based on task requirements and interaction history, effectively learning how to structure experience for improved performance. Experiments demonstrate that adaptive memory design substantially enhances agent performance and generalization across multimodal tasks, highlighting the critical role of learning memory mechanisms in experience-driven learning.

2605.16848 2026-05-19 cs.CV cs.AI cs.CL cs.LG

Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction

基于模式的思考:通过模式诱导突破视觉规划中的感知瓶颈

Yichang Jian, Boyuan Xiao, Zhenyuan Huang, Yifei Peng, Yao-Xiang Ding

AI总结 本文提出通过模式诱导的方法,利用模式推理和模式诱导策略,使视觉语言模型在视觉规划任务中实现更高效和准确的感知与推理,解决传统模型在复杂输入下的感知瓶颈问题。

详情
AI中文摘要

从原始视觉输入进行规划仍然对当前的视觉-语言模型(VLMs)构成重大挑战,当输入复杂度超出其一步感知能力时。受最近在图像思考(TWI)中的进展启发,一种合理的解决方案是通过迭代获取和整合局部视觉证据,将感知过程分解为更简单的步骤。然而,尽管当前VLMs在一般TWI能力上训练良好,但其在规划领域中的感知瓶颈仍然存在。为解决这一挑战,我们将TWI视为一种工具,逐步构建并反映一个准确的内部世界模型。我们发现,由此产生的无训练规划策略使VLMs能够解决远超其初始能力的任务,但代价是过多的TWI操作会显著增加计算开销。为进一步提高效率,我们提出模式推理,一种新的TWI策略,使VLMs能够主动识别新任务中的已知视觉模式并直接推断局部世界模型结构。为了获得这些模式,我们提出模式诱导,一种在线归纳学习策略,将视觉模式视为复合且可重用的专家,这些专家是自主从经验中发现和优化的。在FrozenLake、Crafter和CubeBench领域中的实验评估表明,我们的方法在准确性和效率之间实现了良好的平衡。

英文摘要

Planning from raw visual input remains a significant challenge for current Vision-Language Models (VLMs), when the complexity of input is beyond their one-step perception capability. Motivated by recent advances in Thinking with Images (TWI), a reasonable solution is to decompose the perception process into simpler steps by iteratively acquiring and incorporating local visual evidence. However, even though current VLMs are well-trained in general TWI ability, their perceptual bottleneck in the planning domain remains. To tackle this challenge, we formulate TWI as a tool to gradually build and reflect an accurate internal world model. We find that the resulting training-free planning strategy enables VLMs to solve tasks that are far beyond their initial capabilities, at the cost that too many TWI operations would significantly increase the computational overhead. To further improve efficiency, we propose Pattern Inference, a novel TWI strategy enabling VLMs to actively recognize known visual patterns in the new tasks and directly infer local world model structures. To obtain these patterns, we propose Pattern Induction, an online inductive learning strategy treating visual patterns as composite and reusable experts, which are autonomously discovered and optimized from experience. Experimental evaluations in FrozenLake, Crafter and CubeBench domains show that our approaches achieve a desirable balance between accuracy and efficiency.

2605.16844 2026-05-19 cs.AI

Artificial Adaptive Intelligence: The Missing Stage Between Narrow and General Intelligence

人工适应智能:狭义智能与通用智能之间的缺失阶段

Boris Kriuk

AI总结 本文探讨了狭义智能与通用智能之间缺失的机器行为阶段,提出人工适应智能(AAI)的概念,通过定义适应性指数和参数最小性原则,分析了实现AAI的三种路径,并展示了其在多个领域的应用。

详情
AI中文摘要

在我们部署的狭义系统和我们推测的通用智能之间,存在一个从未被命名的机器行为阶段。本文主张这一阶段并非空缺:它是在元学习、神经架构搜索、AutoML、持续学习、进化计算和物理感知建模等技术中悄然汇聚的共同原则,即持续地将人类从参数规范的循环中排除。我们将其命名为人工适应智能(AAI),并对其进行操作性定义:一个系统表现出AAI的程度在于它不需要人类指定的可调超参数,同时在多样化的任务分布中保持竞争性性能。为使定义量化,我们引入了一个适应性指数,该指数衡量在与规模正交的轴上进展的进度,结合了系统吸收的超参数比例与相对于任务专用基线的性能比率。我们发展了参数最小性原则,并基于最小描述长度框架加以阐述,表明适当的超参数数量是由数据决定而非设计者决定。随后,我们围绕实现最小性的三条路径组织该领域:数据和任务感知的配置、结构和进化形态变化,以及训练中的自我适应。我们分析了它们的稳定性、收敛性和治理影响,并通过涵盖航空航天设计、金融制度检测、湍流建模、生态动态和视觉语言系统等案例研究来说明这些路径。本文的论点是:从ANIL到AGI的路径经过AAI,并且命名这一阶段改变了我们测量、构建和称作成功的标准。

英文摘要

Between the narrow systems we deploy and the general intelligence we speculate about lies an entire regime of machine behavior that has never received its own name. This monograph argues that this regime is not empty: it is where meta-learning, neural architecture search, AutoML, continual learning, evolutionary computation, and physics-informed modeling have quietly converged on a common principle, namely the steady removal of the human from the loop of parameter specification. We name this regime Artificial Adaptive Intelligence (AAI) and define it operationally: a system exhibits AAI to the extent that it requires no human-specified tunable hyperparameters while maintaining competitive performance across a diverse distribution of tasks. To make the definition quantitative, we introduce an adaptivity index that measures progress along an axis orthogonal to scale, combining the fraction of hyperparameters absorbed by the system with the performance ratio against a task-specialized baseline. We develop the principle of parametric minimality and ground it in the minimum description length framework, showing that the appropriate hyperparameter count is data-determined rather than designer-determined. We then organize the field around three pathways to minimality: data- and task-aware configuration, structural and evolutionary morphing, and in-training self-adaptation. We analyze their stability, convergence, and governance implications, and illustrate them through case studies spanning aerospace design, financial regime detection, turbulence modeling, ecological dynamics, and vision-language systems. The thesis is that the path from ANI to AGI passes through AAI, and that naming this stage changes what we measure, what we build, and what we call a success.

2605.16843 2026-05-19 cs.CL

RTI-Bench: A Structured Dataset for Indian Right-to-Information Decision Analysis

RTI-Bench: 一个用于印度权利信息决策分析的结构化数据集

Joy Bose

AI总结 本文提出RTI-Bench,一个包含印度中央信息委员会(CIC)决策的结构化数据集,用于分析权利信息决策,该数据集首次公开发布,包含结果标签、豁免引用、IRAC风格的推理组件和程序时间线。

Comments 8 pages, 4 tables

详情
AI中文摘要

印度《2005年信息权利法》赋予每位公民要求公共机构提供信息的权利,但在实践中,大多数人无法理解中央信息委员会(CIC)决定中密集的行政语言,更不用说预测是否值得提出上诉。本文介绍了RTI-Bench,一个包含CIC决定的结构化数据集,包含结果标签、豁免引用、IRAC风格的推理组件和程序时间线。据我们所知,这是首个公开发布的印度RTI行政决定结构化数据集。该数据集来自两个来源:1,218个公开可用的指令-响应语料库(通过规则提取添加了结构化字段),以及298个直接从委员会门户网站收集的CIC决定PDF文件,涵盖2023至2026年的五个委员和三种文档格式版本。在指令-响应语料库上,标签覆盖率达到89%。对于239个主要决定的PDF子集,本次首次发布覆盖率为51%。对50个标记案例的随机样本进行了人工审查,得出的标签精度为95.3%。在100个案例上的零样本Mistral 7B基线模型在结果预测上的准确率为57.3%,宏F1得分为37.0%,高于多数类基线的14.3%宏F1。RTI-Bench可在https://huggingface.co/datasets/joyboseroy/rti-bench获取。

英文摘要

India's Right to Information Act, 2005 gives every citizen the right to demand information from public authorities, yet in practice most people cannot make sense of the dense administrative language used in Central Information Commission (CIC) decisions, let alone predict whether an appeal is worth filing. This paper introduces RTI-Bench, a structured dataset of CIC decisions with outcome labels, exemption citations, IRAC-style reasoning components, and procedural timelines. To the best of our knowledge it is the first publicly released structured dataset for Indian RTI administrative decisions. The dataset draws from two sources: 1,218 cases from a publicly available instruction-response corpus (with structured fields added through rule-based extraction), and 298 CIC decision PDFs collected directly from the Commission portal, spanning five commissioners and three document format generations from 2023 to 2026. Label coverage reaches 89% on the instruction-response corpus. For the PDF subset of 239 primary decisions, coverage is 51% in this first release. A random sample of 50 labelled cases was manually reviewed, yielding a label precision of 95.3%. A zero-shot Mistral 7B baseline on 100 cases gives 57.3% accuracy and 37.0% macro-F1 on outcome prediction, well above the majority-class baseline of 14.3% macro-F1. RTI-Bench is available at https://huggingface.co/datasets/joyboseroy/rti-bench

2605.16842 2026-05-19 cs.AI

Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

草图然后绘画:用于扩散多模态大语言模型的分层强化学习

Siqi Luo, Jianghan Shen, Yi Xin, Huayu Zheng, Haoxing Chen, Yan Tai, Yue Li, Junjun He, Yihao Liu, Guangtao Zhai, Yuewen Cao, Xiaohong Liu

AI总结 本文提出了一种分层强化学习方法HT-GRPO,通过Sketch-Then-Paint训练方案和分层信用分配机制,解决扩散多模态大语言模型在强化学习优化中的关键问题,提升图像质量和审美效果。

详情
AI中文摘要

扩散多模态大语言模型(dMLLMs)在图像生成方面具有强大能力,但通过强化学习(RL)进行优化仍是一个主要挑战。一个主要困难是单张图像可以通过许多不同的去屏蔽序列生成,这使得计算重要性比率往往不可行。此外,现有方法往往忽视dMLLMs的分层生成过程,其中早期标记定义全局布局,后期标记关注局部细节。通过给所有标记分配均匀奖励,这些现有方法未能反映每个标记对最终图像的实际贡献。为了解决这些问题,我们提出了Hierarchical Token GRPO(HT-GRPO),将此层次结构直接整合到策略优化过程中。我们的方法特征一个Sketch-Then-Paint训练方案,将更新过程分为三个不同的阶段:全局、结构和细化。我们还使用一个提示条件估计器来从完全遮蔽状态开始计算重要性比率。此外,我们引入了一种分层信用分配机制,优先考虑关键结构标记,以确保准确的奖励传播。使用两种流行的dMLLM骨干网络MMaDA和Lumina-DiMOO进行的实验表明,HT-GRPO在GenEval和DPG基准上取得了显著成效。在六个额外指标上的评估证实了在图像质量、美学和人类偏好方面的显著改进。

英文摘要

Diffusion Multi-Modal Large Language Models (dMLLMs) are powerful for image generation, but optimizing them through reinforcement learning (RL) remains a major challenge. One primary difficulty is that a single image can be generated through many different unmasking sequences, which makes calculating importance ratios often intractable. Additionally, existing methods tend to ignore the hierarchical generation process of dMLLMs, where early tokens define the global layout and later tokens focus on local details. By assigning uniform rewards to all tokens, these current methods fail to reflect the actual contribution of each token to the final image. To address these issues, we propose Hierarchical Token GRPO (HT-GRPO), which integrates this hierarchy directly into the policy optimization process. Our approach features a Sketch-Then-Paint training scheme that organizes updates into three distinct stages: global, structure, and refinement. We also use a prompt-conditioned estimator to calculate importance ratios starting from a fully masked state. Furthermore, we introduce a Hierarchical Credit Assignment mechanism that prioritizes key structural tokens to ensure accurate reward propagation. Experiments using two popular dMLLM backbones, MMaDA and Lumina-DiMOO, demonstrate that HT-GRPO achieves substantial gains on the GenEval and DPG benchmarks. Evaluations across six additional metrics confirm significant improvements in image quality, aesthetics, and human preference.

2605.16839 2026-05-19 cs.CL

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

CompactAttention: 加速分块预填的块-联合KV选择

Jiwon Song, Dongwon Jo, Beomseok Kang, Jae-Joon Kim

AI总结 本文提出CompactAttention,一种基于块-联合KV选择的分块预填注意力机制,通过将二维块稀疏掩码作为KV选择信号,实现高效的注意力计算,从而在保持精度的同时提升2.72倍的注意力速度。

详情
AI中文摘要

分块预填已成为长上下文大语言模型广泛采用的服务策略,但在这种模式下高效计算注意力仍然具有挑战性。现有稀疏注意力方法主要针对一次性预填设计,无法有效转换为分块预填:块稀疏内核在查询长度受限于分块大小时效率降低,而细粒度模式搜索在每次分块累积KV缓存中重复时变得昂贵。QUOKA是一种近期针对分块预填的方法,避免了稀疏内核的开销,但依赖于查询子采样、令牌级的KV选择,这可能导致遗漏查询特定的KV条目并引入显式的KV复制开销。为了解决这些限制,我们提出了CompactAttention,一种基于块-联合KV选择的分块预填注意力机制。CompactAttention将二维块稀疏掩码作为KV选择信号,而不是直接的稀疏内核执行计划,并将其转换为GQA-aware的每组KV块表,通过Q块联合和组内联合。这种构造产生了最小的块表,保留了输入掩码所选择的所有KV块,在分页执行约束下,使所选KV块能够原地访问,而无需显式的KV压缩。在LLaMA-3.1-8B-Instruct上,CompactAttention在RULER基准测试中保持的精度接近密集注意力,同时在128K上下文长度下的分块预填中提供高达2.72倍的注意力加速。

英文摘要

Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed for one-shot prefill and do not translate efficiently to chunked prefill: block-sparse kernels lose efficiency when the query length is limited by the chunk size, while fine-grained pattern search becomes costly when repeated over the accumulated KV cache at every chunk. QUOKA, a recent method that directly targets chunked prefill, avoids sparse-kernel overhead but relies on query-subsampled, token-level KV selection, which can miss query-specific KV entries and introduce explicit KV-copy overhead. To address these limitations, we propose CompactAttention, a chunked-prefill attention mechanism based on Block-Union KV Selection. CompactAttention treats 2D block-sparse masks as KV-selection signals rather than direct sparse-kernel execution plans, and converts them into GQA-aware per-group KV block tables through Q-block union and intra-group union. This construction produces the minimal block tables that preserve all KV blocks selected by the input masks under paged execution constraints, enabling selected KV blocks to be accessed in place without explicit KV compaction. On LLaMA-3.1-8B-Instruct, CompactAttention maintains accuracy close to dense attention on the RULER benchmark while delivering up to 2.72$\times$ attention speedup at 128K context length under chunked prefill.

2605.16834 2026-05-19 cs.CV cs.AI cs.LG

Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data

基于有限数据的细粒度多模态对齐的相对表示学习

Shiwon Kim, Yu Rang Park

AI总结 本文提出了一种基于相对表示的学习方法,用于在有限数据条件下实现细粒度多模态对齐,通过学习token级别的跨模态结构来提升零样本分类、跨模态检索和零样本分割任务的性能。

详情
AI中文摘要

多模态预训练展示了强大的泛化性能,但在缺乏配对数据的领域中,这种范式往往难以实施。一种有前景的替代方法是事后多模态对齐,它通过有限数量的配对示例分别对预训练的单模态编码器进行对齐。然而,现有方法主要关注全局表示的对齐,忽略了片段-token关系。这可能阻碍了需要细粒度跨模态匹配的任务的迁移,超越粗粒度样本层面的语义。为了解决这个问题,我们提出了一种事后对齐方法,通过相对表示学习token级别的跨模态结构。具体来说,我们通过图像和文本与每种模态空间中一组可学习锚点的token级相似性来表示它们,这些锚点被训练以诱导一致的跨模态相似性模式,以匹配对。尽管仅学习锚点而没有重大的投影层,我们的方法在零样本分类、跨模态检索和零样本分割任务中均显著优于现有方法。这突显了在有限配对数据下,建模细粒度跨模态结构对于有效事后多模态对齐的重要性。

英文摘要

Multimodal pre-training demonstrates strong generalization performance, but this paradigm is often impractical in domains where paired data are scarce. A promising alternative is post-hoc multimodal alignment, which aligns separately pre-trained unimodal encoders using a limited number of paired examples. However, existing methods focus primarily on aligning global representations, missing patch-token relations. This may hinder transfer to tasks that require fine-grained cross-modal matching beyond coarse sample-level semantics. To address this issue, we propose a post-hoc alignment method that learns token-level cross-modal structure using relative representations. Specifically, we represent images and texts through their token-level similarities to a set of learnable anchors in each modality space, which are trained to induce consistent cross-modal similarity patterns for matched pairs. Despite learning only the anchors without heavy projection layers, our approach consistently outperforms existing methods in zero-shot classification, cross-modal retrieval, and zero-shot segmentation by a substantial margin. This highlights the importance of modeling fine-grained cross-modal structure for effective post-hoc multimodal alignment with limited paired data.

2605.16832 2026-05-19 cs.CV

Coarse Semantic Injection for LLM-Conditioned Structured Indoor Prediction

粗粒度语义注入用于LLM条件的结构室内预测

Shuliang Zhu, Tomiwa Adey, Jinjia Zhou

AI总结 本文提出了一种接口保持的语义增强方法,用于LLM条件的结构解码,通过将语义证据与点云表示关联,将其编码为RGBB点接口,以提升结构室内预测的精度,特别是在复杂场景中的门框定位和家具检测。

详情
AI中文摘要

大型语言模型(LLMs)最近被用作结构解码器,用于从3D点云输入中进行室内理解。然而,点云编码器在体素化和稀疏池化后,往往低估了如门和窗等细长结构元素,并可能在拥挤场景中遗漏单个家具实例。我们提出了一种接口保持的语义增强方法,用于LLM条件的结构解码。关键思想是将语义证据与点云表示关联,将其缩减为粗粒度四组代码(家具、墙壁、开口和其他),并将其编码为RGBB点接口:红色表示家具,绿色表示墙壁,蓝色表示开口,黑色表示其他,其中RGBB表示在三个RGB通道中用三种颜色表示四种语义状态,而不是额外的第四通道。该语义颜色代码在原始原始点属性后附加,因此几何和语义共享相同的稀疏标记化路径,同时下游语言模型解码器和输出序列化保持不变。我们进一步引入了一个轻量级的路由语义位移模块,其辅助头仅用于训练时的比率/预算正则化和分析,以在稀疏池化后加强语义线索。整体流程可以使用RGB衍生的语义证据。在这些受控的语义源设置下,报告的指标在Structured3D、SpatialLM数据集和ARKitScenes上均有所提升,尤其是在拥挤场景中的开口定位和单个家具检测。消融实验澄清了语义源、颜色编码、标记融合和位移注入的作用,同时显示颜色/熵效应仍然非平凡。

英文摘要

Large language models (LLMs) have recently been used as structured decoders for indoor understanding from 3D point-token inputs. However, point cloud encoders often under-represent thin structural elements such as doors and windows after voxelization and sparse pooling, and may miss individual furniture instances in cluttered scenes. We propose an interface-preserving semantic augmentation for LLM-conditioned structured decoding. The key idea is to associate semantic evidence with the point-cloud representation, reduce it to a coarse four-group code (furniture, walls, openings, and others), and encode it as an RGBB point interface: red for furniture, green for walls, blue for openings, and black for others, where RGBB denotes four semantic color states represented in three RGB channels rather than an additional fourth channel. This semantic color code is appended to the original raw point attributes before tokenization, so geometry and semantics share the same sparse tokenization path while the downstream language model decoder and output serialization remain unchanged. We further introduce a lightweight routed semantic shift module, with an auxiliary head used only for training-time ratio/budget regularization and analysis, to strengthen semantic cues after sparse pooling. The overall pipeline can use RGB-derived semantic evidence. Under these controlled semantic-source settings, the reported metrics improve across Structured3D, the SpatialLM dataset, and ARKitScenes, especially for opening localization and per-instance furniture detection in cluttered scenes. Ablations clarify the roles of semantic source, color coding, token fusion, and shift injection, while also showing that color/entropy effects remain nontrivial.

2605.16829 2026-05-19 cs.CL cs.PL

Constrained Code Generation with Discrete Diffusion

带有离散扩散的约束代码生成

Lize Shao, Michael Cardei, Zichen Xie, Ferdinando Fioretto, Wenxi Wang

AI总结 本文提出了一种无需训练的神经符号推理框架CDC,通过将约束满足直接整合到反向去噪过程中,提升代码生成中功能正确性、安全性和语法的约束满足能力。

详情
AI中文摘要

离散扩散模型是一种强大的新兴范式,用于代码生成。它们通过迭代地对部分损坏的token序列进行细化来构建程序,并能够并行地细化token。重要的是,这种范式在每次去噪步骤中暴露了全局程序状态,这为强制执行程序级功能和安全约束提供了自然的干预点,从而在最终代码提交前指导生成。基于这一观察,本文引入了Constrained Diffusion for Code (CDC),一种无需训练的神经符号推理框架,它将约束满足直接整合到反向去噪过程中。CDC在基础离散扩散采样器上增加了具有约束意识的去噪运算符,这些运算符结合数学优化与程序分析,以识别中间程序状态中与约束相关的区域,并在本地调整去噪轨迹,使生成过程朝向可行的程序发展,同时保持接近基础模型。在代码生成基准测试中,CDC在功能正确性、安全性和语法方面一致提高了约束满足能力,优于离散扩散和自回归基线,使用更少的纠正计算和更局部的编辑。

英文摘要

Discrete diffusion models are a powerful, emerging paradigm for code generation. They construct programs through iterative refinement of partially corrupted token sequences and enable parallel token refinement. Importantly, this paradigm exposes a global program state at each denoising step, which provides a natural intervention point for enforcing program-level functionality and security constraints, guiding the generation before the final code is committed. Building on this observation, the paper introduces Constrained Diffusion for Code (CDC), a training-free neurosymbolic inference framework that integrates constraint satisfaction directly into the reverse denoising process. CDC augments the base discrete diffusion sampler with constraint-aware denoising operators that combine mathematical optimization with program analysis to identify constraint-relevant regions of the intermediate program state and locally adjust the denoising trajectory, steering generation toward feasible programs while remaining close to the base model. Across code generation benchmarks, CDC consistently improves constraint satisfaction in functional correctness, security, and even syntax, outperforming discrete diffusion and autoregressive baselines with less corrective computation and more localized edits.

2605.16827 2026-05-19 cs.AI

Voices in the Loop: Mapping Participatory AI

循环中的声音:参与式AI的映射

Rashid Mushkani

AI总结 本文研究了参与式AI在公共、公民和人道主义领域的应用,提出了一个开放的参与式AI倡议存储库和交互式地图集,通过整合Maga~na和Shilton的可信AI语料库以及额外的审计案例,揭示了参与式AI在地理分布、参与层级、生命周期、组织形式、验证状态和文档缺口等方面的模式,并展示了如何通过版本发布、记录链接的问题和注释通道、模式反馈流程和删除或限制披露请求来实现参与式AI基础设施的设计和治理框架。

Comments Accepted to The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26), June 25--28, 2026, Montreal, QC, Canada

详情
AI中文摘要

参与式人工智能的方法在公共、公民和人道主义领域日益被记录,但关于参与如何组织的证据仍碎片化。本文报告了构建一个开放的参与式AI倡议存储库和交互式地图集的过程,使用了Maga~na和Shilton的可信AI语料库中的记录进行协调,以及来自研究和实践的额外审计案例。我们贡献了三个要素。首先,我们指定了一个可重复的发现、审查、协调、地理编码、来源追踪和基于发布版本的发布协议。第二,我们报告了语料库层面在地理、参与层级、生命周期位置、组织形式、验证状态和剩余文档缺口的模式。记录的倡议仍然集中在少数国家,而参与通常被编码在问题制定、评估和治理阶段,而不是模型开发或训练阶段。第三,我们展示了地图集如何通过版本发布、记录链接的问题和注释通道、模式反馈流程以及删除或限制披露请求来实现参与式AI基础设施的设计和治理框架。地图集旨在通过一个可以更新、争议和重用的活体清单,支持比较研究、政策学习和社区审查。

英文摘要

Participatory approaches to artificial intelligence are increasingly documented across public, civic, and humanitarian settings, but evidence about how participation is organized remains fragmented. This paper reports on the construction of an open repository and interactive atlas of participatory AI initiatives, using records harmonized from Maga~na and Shilton's Trustworthy AI corpus, and additional audited cases from research and practice. We contribute three elements. First, we specify a reproducible protocol for discovery, vetting, harmonization, geocoding, provenance tracking, and release-based publication of participatory AI records. Second, we report corpus-level patterns in geography, participation tiers, lifecycle loci, organizational form, verification status, and remaining documentation gaps. Documented initiatives remain concentrated in a small number of countries, while participation is most often coded at problem formulation, evaluation, and governance rather than model development or training. Third, we show how the atlas operationalizes a design and governance framework for participatory-by-default AI infrastructures through versioned releases, record-linked issue and annotation channels, schema feedback workflows, and redaction or restricted-disclosure requests. The atlas is intended to support comparative research, policy learning, and community scrutiny through a living inventory that can be updated, contested, and reused.

2605.16826 2026-05-19 cs.LG cs.AI cs.CL

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

解耦KL与轨迹:为LLM蒸馏中的SFT、DAgger、离线RL和OPD提供统一视角

Anhao Zhao, Haoran Xin, Yingqi Fan, Junlong Tong, Wenjie Li, Xiaoyu Shen

AI总结 本文探讨了知识蒸馏中KL散度与轨迹之间的耦合问题,通过解耦两个轴向,提出了四种有效的蒸馏目标,并通过实验揭示了KL方向、前缀源和训练长度之间的权衡关系,提出了KL混合和熵门长度课程等实用方法。

Comments Code available at https://github.com/EIT-NLP/Decoupled-Distill

详情
AI中文摘要

知识蒸馏是LLM后训练的核心,但其设计空间仍不明确,尤其是在与强化学习(RL)结合时。我们展示了主流范式,即离线蒸馏和在线蒸馏(OPD),隐含地耦合了两个正交选择:前缀源和token级KL方向。这源于将序列级KL分解为自回归响应分布的KL:前向KL将教师前缀与token级前向KL配对,而反向KL将学生前缀与token级反向KL配对。我们主张这种耦合并非本质:解耦这两个轴向会产生四个有效的目标。我们建立了梯度级恒等式,显示前向KL给出SFT风格的交叉熵匹配,而反向KL给出RL风格的策略梯度目标,连接到离线SFT、DAgger风格的在线SFT、离线RL风格的蒸馏和OPD。我们在数学推理上进行了广泛的受控研究,评估了四个目标作为独立方法和后续RL的初始化。结果揭示了三个权衡:KL方向引起准确度-熵权衡,前缀源引起质量-计算权衡,训练长度引起准确度-稳定性权衡。受这些发现启发,我们提出了KL混合和熵门长度课程。KL混合显示长序列蒸馏需要显著的前向KL权重以防止熵崩溃和长度膨胀而不牺牲准确性。熵门长度课程提高了Avg@k和Pass@k分别3.6和高达5.8个点,并将平均响应长度减少了约3倍。我们的结果提供了一个框架和实用方法,用于设计平衡准确度、多样性、计算和RL行为的推理蒸馏目标。

英文摘要

Knowledge distillation is central to LLM post-training, yet its design space remains poorly understood, especially alongside reinforcement learning (RL). We show that the prevailing paradigms, off-policy distillation and on-policy distillation (OPD), implicitly couple two orthogonal choices: prefix source and token-level KL direction. This follows from decomposing sequence-level KL over autoregressive response distributions: forward KL pairs teacher prefixes with token-level forward KL, and reverse KL pairs student prefixes with token-level reverse KL. We argue this coupling is not intrinsic: decoupling the two axes yields four valid objectives. We establish gradient-level identities showing forward KL gives SFT-style cross-entropy matching with teacher soft targets, whereas reverse KL gives an RL-style policy-gradient objective with a dense teacher-student log-ratio reward, connecting them to off-policy SFT, DAgger-style on-policy SFT, offline-RL-style distillation, and OPD. We conduct an extensive controlled study on math reasoning, evaluating the four objectives both as standalone methods and as initializations for subsequent RL. The results reveal three tradeoffs: KL direction induces an accuracy-entropy tradeoff, prefix source a quality-compute tradeoff, and training length an accuracy-stability tradeoff. Motivated by these findings, we propose KL mixing and an entropy-gated length curriculum. KL mixing shows long-sequence distillation requires substantial forward-KL weight to prevent entropy collapse and length inflation without sacrificing accuracy. The entropy-gated length curriculum improves Avg@k and Pass@k by 3.6 and up to 5.8 points, and cuts average response length by roughly 3x versus fixed long-horizon training. Our results provide a framework and practical methods for designing reasoning distillation objectives that balance accuracy, diversity, compute, and RL behavior.

2605.16824 2026-05-19 cs.LG cs.CL

Confidence Geometry Reveals Trace-Level Correctness in Large Language Model Reasoning

置信几何揭示大语言模型推理中的痕量正确性

Shuo Liu, Ding Liu, Shi-Ju Ran

AI总结 本文研究了大语言模型推理正确性与置信轨迹之间的关系,提出通过置信几何分析来区分正确与错误推理轨迹,并展示了NeuralConf方法在提高推理准确性方面的有效性。

Comments 11 pages, 9 figures, 1 table. Code is available at https://github.com/QML-TGU/NeuralConf

详情
AI中文摘要

大语言模型(LLMs)不仅生成推理文本,还生成记录推理过程中不确定性演变的token级置信轨迹。这些轨迹是否与推理正确性相关尚不清楚。本文表明,置信轨迹编码了与痕量最终答案正确性相关的内容无关的置信几何。仅使用token级置信值,不访问输入问题、推理文本、隐藏状态或外部验证器,发现置信轨迹的低维表示能将正确和错误的推理轨迹分开。在GSM8K、MATH和MMLU数据集上,这种几何分离与下游可预测性量度定量相关:正确和错误轨迹的更强聚类(通过Davies-Bouldin指数测量)一致对应更高的正确性判别AUC。进一步发现正确性相关信息在推理尾部得到丰富,表明晚期置信动态携带关键正确性信号。本文提出NeuralConf,一个轻量级估计器,通过置信轨迹学习正确性评估。在固定轨迹预算下,NeuralConf衍生的分数在置信加权答案聚合方面优于多数投票、尾置信度和其他静态基线。这些结果表明,LLMs通过自身的置信动态暴露了正确性的痕量统计信号,为利用生成中已存在的信息提高推理提供了途径。

英文摘要

Large language models (LLMs) generate not only reasoning text, but also token-level confidence trajectories that record how uncertainty evolves during inference. Whether these trajectories are relevant to reasoning correctness remains unclear. Here we show that confidence trajectories encode a content-agnostic confidence geometry associated with trace-level final-answer correctness. Using only token-level confidence values, without access to the input question, reasoning text, hidden states, or external verifiers, we find that low-dimensional representations of confidence trajectories separate correct from incorrect reasoning traces. Across GSM8K, MATH, and MMLU, this geometric separation is quantitatively linked to downstream predictability: stronger clustering of correct and incorrect traces, measured by the Davies--Bouldin index, consistently corresponds to higher correctness-discrimination AUC. We further show that correctness-related information is enriched in the tail of reasoning, suggesting that late-stage confidence dynamics carry key correctness signals. We propose NeuralConf, a lightweight estimator that learns from confidence trajectories for correctness evaluation. Under a fixed trace budget, NeuralConf-derived scores improve confidence-weighted answer aggregation over majority voting, tail confidence, and other static baselines. These results reveal that LLMs expose trace-intrinsic statistical signals of correctness through their own confidence dynamics, offering a route to improve inference using information already present within generation.

2605.16821 2026-05-19 cs.AI

Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework

多范式代理交互实践:buddyMe框架中生成器-评估器、ReAct循环和对抗性评估的系统分析

Xiaohua Wang, Chao Han, Kai Yu, XiaoLiang Xu, Liang Wang

AI总结 本文通过系统分析buddyMe框架中生成器-评估器、ReAct循环和对抗性评估三种代理交互范式,揭示了在实际应用中多范式代理系统的设计挑战与优化策略。

Comments 11 pages, 7 tables

详情
AI中文摘要

大型语言模型(LLM)代理的快速演进产生了多样的交互范式,但很少有生产系统在一个统一的架构中整合多种范式。本文系统分析了三种主要的代理交互范式,包括多代理协调(生成器-评估器)、ReAct工具使用循环和记忆增强交互,这些范式在buddyMe开源的多模型代理编程框架中得到实现。我们正式化了一个五阶段的处理流程:需求预审查->任务分解->ReAct执行->实际执行验证->对抗性评估讨论,并建立了一个六维评估方案,采用加权评分。通过四个实证案例研究,基于真实世界部署日志中的博物馆导游生成、定时天气任务和综合旅游规划,我们得出三个关键结论。第一,生成器-评估器预审查在20%的复杂任务中检测到需求遗漏,其中80%的任务通过初步审查。第二,ReAct循环确保了子任务执行的稳定性,但导致大约30%的冗余工具调用。第三,对抗性评估者-防御者讨论在2-3轮内达成共识,适用于近70%的场景,主要用于内容细化而非逻辑反转。我们还提供了三种基于Mermaid的架构图,并在六个系统维度上与CrewAI、AutoGen、LangGraph、MemGPT和A-Mem进行了跨范式比较。研究结果为构建稳定可靠的多范式代理系统提供了实用的设计指南。

英文摘要

The rapid evolution of Large Language Model (LLM) agents has produced diverse interaction paradigms, yet few production systems integrate multiple paradigms within a unified architecture. This paper presents a systematic analysis of three principal agent interaction paradigms, including Multi-Agent Orchestration (Generator-Evaluator), ReAct Tool-Use Loops, and Memory-Augmented Interaction, as implemented in buddyMe, an open-source multi-model agent programming framework. We formalize a five-stage processing pipeline: Requirement Pre-Review -> Task Decomposition -> ReAct Execution -> Real-Execution Verification -> Adversarial Evaluation Discussion, and establish a six-dimensional evaluation schema with weighted scoring. Through four empirical case studies drawn from real-world deployment logs covering museum guide generation, scheduled weather tasks, and comprehensive tour planning, we draw three key conclusions. First, Generator-Evaluator pre-review detects requirement omissions in 20 percent of complex tasks, with 80 percent tasks passing initial inspection. Second, the ReAct loop ensures stable subtask execution but leads to around 30 percent redundant tool invocations. Third, adversarial Evaluator-Defender discussions reach consensus within 2-3 rounds for nearly 70 percent of scenarios, functioning mainly for content refinement rather than logical reversal. We additionally provide three Mermaid-based architectural diagrams and conduct cross-paradigm comparisons with CrewAI, AutoGen, LangGraph, MemGPT and A-Mem across six system dimensions. The research outcomes offer practical design guidelines for constructing stable and reliable multi-paradigm agent systems.

2605.16819 2026-05-19 cs.CL cs.AI cs.LG

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

AgentKernelArena: GPU核优化代理的通用化意识基准测试

Sharareh Younesian, Wenwen Ouyang, Sina Rafati, Mehdi Rezagholizadeh, Sharon Zhou, Ji Liu, Yue Liu, Yuchen Yang, Hao Li, Ziqiong Liu, Dong Li, Vikram Appia, Zhenyu Gu, Emad Barsoum

AI总结 本文提出AgentKernelArena,一个用于评估GPU核优化代理的开源基准,通过隔离工作区和统一评分机制,测试代理在不同任务和硬件目标上的性能和通用化能力,发现大多数任务在正确性和编译效率上表现优异,但在PyTorch到HIP的转换任务中存在显著的正确性下降。

详情
AI中文摘要

GPU核优化对于高效深度学习系统日益关键,但编写高性能核仍然需要大量的低级专业知识。最近的AI编码代理可以迭代阅读代码、调用编译器和性能分析器,并优化实现,但现有的核基准测试仅评估单个LLM调用而非完整的代理工作流程,且未包含核到核的优化和未见过的配置泛化测试。我们提出了AgentKernelArena,一个开源的基准测试,用于衡量AI编码代理在GPU核优化上的能力。该基准测试包含196个任务,涵盖HIP到HIP的优化、Triton到Triton的优化以及PyTorch到HIP的转换,并在隔离的工作区中使用门控编译、正确性和性能检查,集中评分和一个未见过的配置泛化协议,测试优化是否转移到代理从未见过的输入配置。在包括Cursor Agent、Claude Code和Codex Agent在内的生产代理中,我们发现大多数任务在正确性和编译效率上表现优异,最强配置在PyTorch到HIP任务中平均加速达6.89倍,在HIP到HIP任务中达6.69倍,在Triton到Triton任务中达2.13倍。我们的未见过的配置评估显示,HIP到HIP和Triton到Triton的优化大多能转移到未见过的输入形状,而PyTorch到HIP的转换则表现出显著的正确性下降,表明生成核的代理经常硬编码形状特定的假设。AgentKernelArena被设计为一个模块化、可扩展的框架,用于严格评估跨代理、任务和硬件目标的代理GPU核优化。

英文摘要

GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers and profilers, and refine implementations, yet existing kernel benchmarks evaluate single LLM calls rather than full agent workflows, and none include both kernel-to-kernel optimization and unseen-configuration generalization testing. We present AgentKernelArena, an open-source benchmark for measuring AI coding agents on GPU kernel optimization. The benchmark contains 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation, and evaluates complete agent workflows in isolated workspaces using gated compilation, correctness, and performance checks, centralized scoring and an unseen-configuration generalization protocol that tests whether optimizations transfer to input configurations the agent never observed. Across production agents including Cursor Agent, Claude Code, and Codex Agent, we find near-perfect compilation and high correctness rates on most task categories, with the strongest configurations achieving mean speedups of up to 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks. Our unseen-configuration evaluation shows that HIP-to-HIP and Triton-to-Triton optimizations largely transfer to unseen input shapes, while PyTorch-to-HIP exhibits substantial correctness drops, indicating that agents generating kernels from scratch frequently hardcode shape-specific assumptions. AgentKernelArena is designed as a modular, extensible framework for rigorous evaluation of agentic GPU kernel optimization across agents, tasks, and hardware targets.

2605.16818 2026-05-19 cs.CV cs.AI

Observation-Aligned Mask Priors for Learning Physical Dynamics from Authentic Occlusions

基于观测对齐的遮罩先验学习物理动态的遮罩方法

Chiyuan Ma, Zihan Zhou, Tianshu Yu

AI总结 本文提出了一种基于观测对齐的遮罩先验方法,通过学习真实的遮罩分布来构建上下文-查询分区,从而在不完整数据上训练物理动态学习。该方法利用贝叶斯流网络预训练二进制遮罩,结合全局归一化交叉熵目标生成与稀疏观测对齐的样本特定遮罩,从而避免零查询死区和局部生成崩溃。

详情
AI中文摘要

直接从不完整观测中学习物理动态具有挑战性,因为真实的遮罩是结构化的、样本依赖的,并且常常不是随机缺失的,而现有方法通常依赖启发式遮罩规则或预定义的遮罩分布。我们提出Observation-Aligned Mask Priors框架,该框架学习真实的观测遮罩分布,并利用其构建上下文-查询分区以从不完整数据中训练。具体来说,我们先在二进制观测遮罩上预训练一个贝叶斯流网络(BFN)以捕捉真实的遮罩拓扑结构,然后通过全局归一化交叉熵目标引导BFN采样,生成与每个稀疏观测对齐的样本特定遮罩。遮罩与观测遮罩的交集定义为上下文,剩余的观测条目成为扩散模型的查询目标。我们证明,这种基于交集的分区使每个有效的观测维度都有严格正的概率被查询,防止零查询死区和局部生成崩溃。在三个具有真实卫星遮罩的现实世界海洋学数据集上,跨分辨率至256×256的实验显示,在MSE和PSNR上优于强扩散基线的一致改进。这些结果表明,从真实遮罩中学习遮罩先验是学习不完整物理观测的有效替代方法,无需访问完全观测的场数据。

英文摘要

Learning physical dynamics directly from incomplete observations is challenging because authentic occlusions are structured, sample-dependent, and often missing not at random, whereas existing methods typically rely on heuristic masking rules or predefined mask distributions. We propose Observation-Aligned Mask Priors, a framework that learns the distribution of authentic observation masks and uses it to construct context-query partitions for training from incomplete data. Specifically, we pretrain a Bayesian Flow Network (BFN) on binary observation masks to capture real occlusion topologies, then guide BFN sampling with a globally normalized cross-entropy objective to generate sample-specific masks aligned with each sparse observation. The intersection between the guided mask and the observed mask defines the context, and the remaining observed entries become query targets for a diffusion-based reconstruction model. We show that this intersection-based partitioning gives every valid observed dimension a strictly positive probability of being queried, preventing zero-query dead zones and local generative collapse. Experiments on three real-world oceanographic datasets with authentic satellite occlusions, across resolutions up to 256$\times$256, show consistent improvements over strong diffusion baselines in MSE and PSNR. These results demonstrate that learning mask priors from authentic occlusions is an effective alternative to heuristic masking for learning from incomplete physical observations without access to fully observed fields.

2605.16810 2026-05-19 cs.CV

Training-Free Occluded Text Rendering via Glyph Priors and Attention-Guided Semantic Blending

无需训练的遮挡文本渲染:通过字形先验和注意力引导的语义融合

Jingqi Hou, Hongtian Wang

AI总结 本文提出一种无需训练的遮挡文本渲染框架,通过预训练的FLUX.1-dev模型,解决文本生成中遮挡物位置和文本结构稳定性问题,采用双流推理和字形先验稳定文本结构,提升文本可读性和遮挡对齐效果。

Comments 9 pages, 3 figures, 3 tables

详情
AI中文摘要

我们提出一种无需训练的遮挡文本渲染框架,使用预训练的FLUX.1-dev主干网络。该任务要求模型生成可识别的字体并放置遮挡物在预期文本区域。现有文本到图像生成器在这一设置中仍然具有挑战性:遮挡物往往远离文本,而文本可能被扭曲或漂浮在遮挡物之上。为了解决这个问题,我们提出了一个重启双流推理框架,将文本布局保持与遮挡物插入解耦。基流提供干净的字形参考和相同步骤的键/值(K/V)特征,而编辑流则基于遮挡提示进行条件化。我们进一步采用来自FreeText的光谱字形先验思想,并将其适应于早期到中期去噪过程中稳定目标文本结构。在推理过程中,我们的方法局部化目标文本,从令牌条件化的注意力和字形支持中估计文本带区域,并推导出一个锚点感知的硬融合掩码用于遮挡物。在最终的编辑过程中,生成从相同的初始噪声开始,并在选定的注意力站点应用硬掩码引导的图像-令牌K/V替换,保持基流布局在掩码外,同时在掩码内注入来自编辑流的遮挡物外观。在代表性遮挡文本场景的实验中,显著提高了文本可读性,并在遮挡对齐方面具有竞争力,从而在不进行模型微调的情况下实现了更稳定的物体-文本组合。

英文摘要

We present a training-free framework for occluded text rendering with a pretrained FLUX.1-dev backbone. The task requires a model to render recognizable typography and place an occluding object over the intended text region. This setting remains difficult for existing text-to-image generators: the occluder often drifts away from the text, while the text may be distorted or appear to float on top of the occluding object. To address this problem, we propose a restarted dual-stream inference framework that decouples text-layout preservation from occluder insertion. A Base Stream provides a clean typographic reference and same-step key/value (K/V) features, while the Edit Stream is conditioned on the occlusion prompt. We further adopt the spectral glyph-prior idea from FreeText and adapt it to stabilize the target text structure during early-to-mid denoising. In the reasoning pass, our method localizes the target text, estimates a text-band region from token-conditioned attention and glyph support, and derives an anchor-aware hard fusion mask for the occluder. In the final edit pass, generation restarts from the same initial noise and applies hard mask-guided image-token K/V replacement at selected attention sites, preserving the Base layout outside the mask while injecting the occluder appearance from the Edit Stream inside the mask. Experiments on representative occluded text scenarios demonstrate substantially improved text readability and competitive occlusion alignment, yielding more stable object-on-text compositions without any model fine-tuning.

2605.16809 2026-05-19 cs.LG

Informative Graph Structure Learning

信息导向的图结构学习

Shen Han, Zhiyao Zhou, Jiawei Chen, Sheng Zhou, Canghong Jin, Hai Lin, Da Zhong Li, Bingde Hu, Can Wang

AI总结 本文提出了一种信息导向的图结构学习方法(InGSL),通过结合相似性和多样性来优化图结构,减少边数并提高性能。

详情
AI中文摘要

图结构数据的质量对现代图分析技术如图神经网络(GNNs)的成功至关重要。然而,现实中的图数据往往质量不佳,存在噪声和连接不完整等问题。图结构学习(GSL)作为一种适应性优化节点连接的技术已崭露头角。然而,我们发现GSL的效果常常以边数大幅增加为代价,导致存储和计算开销显著增加。在本工作中,我们揭示这一限制源于广泛使用的基于相似性的边构造方法,该方法主要基于嵌入连接高度相似的邻居,引入了大量结构冗余。为了解决这一问题,我们提出了一种新颖的信息导向图结构学习方法(InGSL),通过引入互信息引导的学习策略,同时考虑相似性和多样性进行边构造。值得注意的是,InGSL作为一种可插拔模块,能够无缝集成到现有的GSL框架中。通过在六个代表性GSL方法上的广泛实验,我们证明InGSL在减少边数的同时实现了显著的性能提升。

英文摘要

The quality of graph-structured data is fundamental to the success of modern graph analysis techniques such as Graph Neural Networks (GNNs). However, real-world graph data is often suboptimal, suffering from issues such as noise and incomplete connections. Graph Structure Learning (GSL) has emerged as a promising technique that adaptively optimizes node connections. However, we observe that the effectiveness of GSL often comes at the cost of a dramatic expansion in edge count, resulting in significant storage and computational overhead. In this work, we reveal that this limitation stems from the prevalent use of similarity-based edge construction, which predominantly connects highly similar neighbors based on their embeddings, introducing substantial structure redundancy. To address this, we propose a novel Informative Graph Structure Learning method (InGSL), which jointly considers both similarity and diversity in edge construction by incorporating a mutual-information-guided learning strategy. Notably, InGSL serves as a plug-in module that can be seamlessly integrated into existing GSL frameworks. Through extensive experiments on six representative GSL methods, we demonstrate that InGSL achieves significant performance improvements at a reduced number of edges.

2605.16807 2026-05-19 cs.CV

DecoRec: Decomposed 3D Scene Reconstruction from Single-View Images via Object-Level Diffusion

DecoRec: 通过物体级扩散进行单视图图像的分解3D场景重建

Yuhan Ping, Yuan Liu, Xiaoxiao Long, Peng Wang, Junhui Hou, Jianyi Zheng, Jia Pan, Xin Li, Cheng Lin

AI总结 本文提出DecoRec系统,通过物体级扩散方法实现单视图图像的分解3D场景重建,解决了现有方法在场景重建中出现的精度问题,并通过可微渲染和扩散引导细化技术提升重建效果。

详情
AI中文摘要

在本文中,我们介绍了DecoRec,一种新的系统,旨在将单视图2D图像提升为分解的3D场景网格。当前单视图场景重建方法通常依赖于物体检索或粗粒度3D体素或表面的回归,导致无法准确捕捉输入图像的外观和几何结构。缺乏高质量的大规模场景级数据集进一步加剧了从单视图图像直接生成3D场景的难度。为了实现高质量的3D场景生成,DecoRec利用最近的基于扩散的单视图物体重建方法,分别重建单个物体。随后提出一个细化流程,通过可微渲染技术和扩散引导细化技术有效地将这些重建的物体合并,提升外观和几何结构。我们的结果表明,DecoRec在几何和新合成方面实现了高质量的单视图场景重建,为下游应用如房间内部设计提供了显著的便利。

英文摘要

In this paper, we introduce \textit{DecoRec}, a novel system designed to elevate single-view 2D images to a decomposed 3D scene mesh. Current methods for single-view scene reconstruction typically rely on object retrieval or the regression of coarse 3D voxels or surfaces, leading to inaccuracies in capturing the appearance and geometry of the input image. The lack of high-quality large-scale scene-level datasets further complicates direct 3D scene generation from single-view images. To achieve high-quality 3D scene generation from a single-view image, DecoRec takes advantage of recent diffusion-based single-view object reconstruction methods to reconstruct individual objects separately. Subsequently, a refinement pipeline is proposed to effectively merge these reconstructed objects, enhancing appearance and geometry through a differentiable rendering technique and diffusion-guided refinement. Our results demonstrate that DecoRec facilitates high-quality single-view scene reconstruction in both geometry and novel synthesis, offering significant benefits for downstream applications like room interior design.

2605.16806 2026-05-19 cs.LG cs.AI cs.CV

Cross-modal Affinity-aligned Multimodal Learning Analytics for Predicting Student Collaboration Satisfaction in Game-Based Learning

跨模态亲和对齐的多模态学习分析用于预测基于游戏的学习中学生协作满意度

Wen-Hsin Tsai, Chia-Ming Lee, Yuk-Ying Tung

AI总结 本文提出了一种跨模态亲和对齐的多模态学习分析框架,通过建模模态间关系和对比学习来增强学生协作满意度预测的鲁棒性和可解释性。

Comments Accetped by CVPR 2026 CVxEdu Workshop

详情
AI中文摘要

协作式基于游戏的学习环境为小组知识构建提供了丰富的机遇,但自动预测学生协作满意度仍具挑战性。关键障碍是模态退化:在教育部署中,个体模态如眼动在学生群体间表现出不一致的信息量,导致基于隐式注意力的融合产生脆弱的多模态表示。我们提出了亲和对齐多模态学习分析(AAMLA)框架,其核心贡献是跨模态亲和引导的模态对齐(CAMA)模块,该模块通过亲和矩阵显式建模模态间关系,并通过对比学习强制跨模态一致性,从而实现对无信息模态的自适应抑制而不丢弃它们。AAMLA进一步应用模态特定的投影层,将异构特征,包括面部动作单元、头部姿态、眼动和交互痕迹日志,映射到统一的语义空间,然后再进行对齐。在EcoJourneys协作学习环境中的50名中学生实验表明,在标准和模态退化条件下,AAMLA在单模态基线和先前跨注意力方法上均表现出一致的改进,SHAP和t-SNE分析证实CAMA能够产生稳健且可解释的跨模态表示,用于学生协作建模。

英文摘要

Collaborative game-based learning environments offer rich opportunities for small-group knowledge construction, yet automatically predicting student collaboration satisfaction remains challenging. A critical barrier is modality degradation: in educational deployments, individual modalities such as eye gaze exhibit inconsistent informativeness across student cohorts, causing implicit attention-based fusion to produce brittle multimodal representations. We propose the Affinity-Aligned Multimodal Learning Analytics (AAMLA) framework, whose core contribution is the Cross-modal Affinity-guided Modality Alignment (CAMA) module, which explicitly models inter-modal relationships via affinity matrices and enforces cross-modal consistency through contrastive learning, enabling adaptive suppression of uninformative modalities without discarding them. AAMLA further applies modality-specific projection layers to map heterogeneous features, including facial action units, head pose, eye gaze, and interaction trace logs, into a unified semantic space prior to alignment. Experiments on 50 middle school students in the EcoJourneys collaborative learning environment demonstrate consistent improvements over unimodal baselines and prior cross-attention approaches under standard and modality degradation conditions, with SHAP and t-SNE analyses confirming that CAMA produces robust, interpretable cross-modal representations for student collaboration modeling.

2605.16805 2026-05-19 cs.CV

NeuroLiDAR: Adaptive Frame Rate Depth Sensing via Neuromorphic Event-LiDAR Fusion

NeuroLiDAR: 通过神经形态事件-LiDAR融合实现自适应帧率深度感知

Darshana Rathnayake, Dulanga Weerakoon, Meera Radhakrishnan, Archan Misra

AI总结 本文提出NeuroLiDAR,通过融合稀疏LiDAR数据和密集的神经形态事件相机数据,实现了高达约66Hz的自适应帧率深度感知,减少了29%的深度重建误差。

Comments ICRA2026 accepted

详情
AI中文摘要

LiDARs被广泛用于3D深度重建,但其性能常受到固有硬件限制的制约,这些限制在范围、空间分辨率和帧率之间产生权衡。许多LiDAR系统通常以低帧率(例如5-10Hz)运行,优先考虑远距离传感而不是对快速场景变化的响应。我们提出了NeuroLiDAR,一种能够实现高达约66Hz有效帧率的自适应深度感知框架,通过融合时间稀疏的LiDAR数据与时间密集的神经形态事件相机数据。NeuroLiDAR集成了两个组件:基于事件的关键帧检测和基于事件的深度外推,以动态调整感知速率以响应场景动态。为了评估我们的方法,我们引入了ELiDAR数据集,涵盖了户外和室内场景,并展示了NeuroLiDAR在RMSE中将深度重建误差减少了约29%,同时实现了27.8-47.3Hz的自适应帧率。我们的代码和数据集可在https://github.com/darshanakgr/neurolidar上获得。

英文摘要

LiDARs are widely used for 3D depth reconstruction, but their performance is often limited by inherent hardware constraints that impose trade-offs between range, spatial resolution, and frame rate. Many LiDAR systems typically operate at low frame rates (e.g., 5-10 Hz), prioritizing long-range sensing over responsiveness to rapid scene changes. We present NeuroLiDAR, an adaptive depth sensing framework that achieves effective frame rates of up to $\approx$66 Hz by fusing temporally sparse LiDAR data with temporally dense inputs from neuromorphic event cameras. NeuroLiDAR integrates two components: event-based keyframe detection and event-guided depth extrapolation, to dynamically adjust the sensing rate in response to scene dynamics. To evaluate our approach, we introduce ELiDAR, a dataset spanning outdoor and indoor scenarios, and show that NeuroLiDAR reduces depth reconstruction error by $\approx$29\% in RMSE while achieving adaptive frame rates between 27.8-47.3 Hz. Our code and dataset are available at https://github.com/darshanakgr/neurolidar.

2605.16800 2026-05-19 cs.LG cs.CL

FIM-LoRA: Task-Informative Rank Allocation for LoRA via Calibration-Time Gradient-Variance Estimation

FIM-LoRA: 通过校准时间梯度方差估计实现任务信息的秩分配

Ramakrishnan Sathyavageeswaran

AI总结 本文提出FIM-LoRA,通过校准时间梯度方差估计来分配任务信息的秩,以优化LoRA的秩分配,从而提高模型性能。

Comments 10 pages, 1 figure

详情
AI中文摘要

低秩适应(LoRA)为每个适应的权重矩阵分配一个统一的秩——一种实用的便利,但忽略了一个基本现实:不同层对任务适应的贡献不均。我们通过一种轻量级的工程解决方案来解决这个问题:在微调开始之前,运行八次校准反向传递,计算每个LoRA-B矩阵的梯度方差作为层信息度的代理,并按比例重新分配秩预算。所得到的适配器是一个标准的LoRA,具有每层的秩模式——没有新的参数,没有训练开销,没有对服务基础设施的更改。我们通过高效地近似经验 Fisher 信息矩阵(eFIM)对角线,仅限于 LoRA 适配器矩阵,来实现这一点,这将内存成本降低了大约256倍相比完整的模型 Fisher 估计。在 GLUE 上使用 DeBERTa-v3-base 时,FIM-LoRA 在相同参数预算下与 LoRA 相当(88.6 vs. 88.7),在常识推理上使用 LLaMA-3-8B 时达到 68.5 vs. 68.7。每层的秩映射是可解释的:值投影和早期到中期层一致获得更高的秩,与已建立的 transformer 层角色研究结果一致。

英文摘要

Low-rank adaptation (LoRA) assigns a uniform rank to every adapted weight matrix - a practical convenience that ignores a fundamental reality: different layers contribute unequally to task adaptation. We address this with a lightweight engineering solution: before fine-tuning begins, run eight calibration backward passes, compute the gradient variance of each LoRA-B matrix as a proxy for layer informativeness, and redistribute the rank budget proportionally. The resulting adapter is a standard LoRA with a per-layer rank pattern - no new parameters, no training overhead, no changes to serving infrastructure. We implement this via an efficient approximation of the empirical Fisher Information Matrix (eFIM) diagonal, restricted to LoRA adapter matrices only, which reduces memory cost by approximately 256x compared to full-model Fisher estimation. On GLUE with DeBERTa-v3-base, FIM-LoRA matches LoRA (88.6 vs. 88.7) at the same parameter budget, and on commonsense reasoning with LLaMA-3-8B reaches 68.5 vs. 68.7 for LoRA. The per-layer rank maps are interpretable: value projections and early-to-middle layers consistently receive higher rank, consistent with established findings on transformer layer roles.

2605.16797 2026-05-19 cs.CV cs.RO

EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices

EgoKit: 向统一低成本第一人称视角数据采集迈进:异构设备

Liuchuan Yu, Erdem Murat, Beichen Wang, Yan Zeng, Tingting Luo, Huizhen Zhou, Shanghao Li, Huining Feng, Zhigen Zhao, Ning Yang, Ke Jing, Yunhao Liu, Ruoya Sheng

AI总结 本文提出EgoKit,一种统一六种异构设备的第一人称视角数据采集工具包,解决了不同设备间SDK差异和数据采集不一致的问题,同时提供统一的日志格式和手部追踪数据。

详情
AI中文摘要

第一人称视角视频越来越多地被用作机器人学习、活动理解及具身AI研究的数据源,但大规模采集仍然碎片化:每个候选主机设备,如Android手机、iPhone、iPad、智能眼镜或扩展现实(XR)头戴设备,都暴露了不同的SDK,对原始摄像机访问有不同的政策,以及对外部USB摄像机和设备内跟踪有不同的限制。因此,同步第一人称视角和腕部视角的采集通常通过要么承诺单一专有平台或构建一次性装置来实现,这些装置无法跨设备转移。为了解决这一差距,我们提出了EgoKit,一种工具包,它在六个异构主机设备上暴露相同的第一人称视角录制流程。在所有支持的设备上,EgoKit提供相同的录制交互,并产生本地存储的视频,具有统一的日志格式;在XR头戴设备上,它还记录头部姿态和符合OpenXR标准的26关节手部追踪,与视频流对齐。配套的配件,包括两个带有支架的腕部摄像机、一个头带和一个USB-C集线器,使任何支持的主机都能添加腕部视角捕获,而无需定制硬件制造。EgoKit可在\url{https://egokit.chuange.org/}上获得。

英文摘要

Egocentric video is increasingly used as a data source for robot learning, activity understanding, and embodied AI research, but collecting it at scale remains fragmented in practice: each candidate host device, such as an Android phone, iPhone, iPad, smart glasses, or extended reality (XR) headset, exposes a different SDK, a different policy on raw camera access, and different limitations on external USB cameras and on-device tracking. Synchronized ego-view and wrist-view capture is therefore typically obtained by either committing to a single proprietary platform or building one-off rigs that do not transfer across devices. To address this gap, we present EgoKit, a toolkit that exposes the same egocentric recording workflow across six heterogeneous host devices. Across all supported devices, EgoKit presents the same recording interaction and produces locally stored video with a uniform log format; on XR headsets, it additionally logs head pose and OpenXR-standard 26-joint hand tracking aligned to the video streams. The companion accessories, including two wrist cameras with mounts, a head strap, and a USB-C hub, add wrist-view capture to any supported host without custom hardware fabrication. EgoKit is available at \url{https://egokit.chuange.org/}.

2605.16795 2026-05-19 cs.CV cs.AI cs.GR

3DPhysVideo: Consistency-Guided Flow SDE for Video Generation via 3D Scene Reconstruction and Physical Simulation

3DPhysVideo: 通过3D场景重建和物理模拟的一致性引导流SDE用于视频生成

Hwidong Kim, Yunho Kim, Tae-Kyun Kim

AI总结 本文提出了一种无需训练的管道,通过3D场景重建和物理模拟生成逼真视频,利用一致性引导流SDE分解预测速度以确保条件输入的一致性,从而在多物体和流体交互场景中实现从单张图像到物理合理视频的过渡。

Comments Project page: https://hwidong-kim.github.io/projects/3DPhysVideo

详情
AI中文摘要

视频生成模型取得了显著进展,但它们常常产生违反物理动态基础的视觉伪影。最近的工作如PhysGen3D通过网格重建和基于物理的渲染处理单张图像到3D物理,但在建模流体动力学、多物体交互和照片级真实感方面仍存在挑战。本文介绍了3DPhysVideo,一种新颖的无训练管道,能够从单张图像生成物理真实的视频。我们重新利用现成的视频模型进行两个阶段。首先,我们将其用作新的视图合成器,通过引导图像到视频(I2V)流模型使用渲染点云来重建完整的360度3D场景几何。其次,在应用物理求解器到此几何后,物理模拟的点云用于引导相同的I2V流模型以合成最终的高质量视频。一致性引导流SDE将I2V流模型预测的速度分解为去噪和一致性偏差,强制条件输入的一致性,使我们能够有效地重新利用模型进行3D重建和模拟引导的视频生成。在包括多物体和流体交互场景在内的多样化实验中,我们的方法成功地从单张图像过渡到物理合理的视频,同时在单个消费级GPU上运行高效。它在GPT基线得分、VideoPhy基准和人类评估中优于最先进的基线。

英文摘要

Video generative models have made remarkable progress, yet they often yield visual artifacts that violate grounding in physical dynamics. Recent works such as PhysGen3D tackle single image-to-3D physics through mesh reconstruction and Physically-Based Rendering, but challenges remain in modeling fluid dynamics, multi-object interactions and photorealism. This work introduces 3DPhysVideo, a novel training-free pipeline that generates physically realistic videos from a single image. We repurpose an off-the-shelf video model for two stages. First, we use it as a novel view synthesizer to reconstruct complete 360-degree 3D scene geometry by guiding the image-to-video (I2V) flow model with rendered point clouds. Second, after applying physics solvers to this geometry, the physically simulated point cloud is used to guide the same I2V flow model to synthesize final, high-quality videos. Consistency-Guided Flow SDE, which decomposes the predicted velocity of the I2V flow model into denoising and consistency bias, enforces consistency to the conditional inputs, allowing us to effectively repurpose the model for both 3D reconstruction and simulation-guided video generation. In the diverse experiments including multi-objects, and fluid interaction scenes, our method successfully bridges the gap from single-images to physically plausible videos, while remaining efficient to run on a single consumer GPU. It outperforms state-of-the-art baselines on GPT-based scores, VideoPhy benchmark and human evaluation.

2605.16790 2026-05-19 cs.LG cs.AI cs.CL

TIER: Trajectory-Invariant Execution Rewards for Multi-Step Tool Composition

TIER: 用于多步工具组合的轨迹不变执行奖励

Anay Kulkarni, ChiaEn Lu, Dheeraj Mekala, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang

AI总结 本文提出TIER,一种基于函数模式和运行时执行的奖励框架,能够提供密集且可解释的序列级反馈,支持多种解决方案策略并适应变化的工具接口,在DepthBench等基准上实现了高准确率。

Comments Preprint. Submitted to NeurIPS 2026. 28 pages, 7 figures, 8 tables. Code and datasets available at https://github.com/anaykulkarni/TIER

详情
AI中文摘要

工具使用使大语言模型能够通过一系列API调用解决复杂任务,但现有的强化学习方法无法扩展到多步骤组合设置。基于结果的奖励只能提供稀疏反馈,而轨迹监督的奖励依赖于注释的参考解决方案,惩罚有效的替代方案并限制可扩展性。我们提出TIER:轨迹不变执行奖励,一种奖励框架,其监督直接来自函数模式和运行时执行,而非参考轨迹。该奖励分解为格式有效性、模式遵守、执行成功和答案正确性,提供来自细粒度验证的单个步骤工具使用反馈。这种设计允许任何有效的执行路径获得信用,自然支持多种解决方案策略并适应变化的工具接口。在DepthBench,一个按深度(1到6步)分层的组合基准上,TIER在所有步骤中实现了>90%的准确率,其中轨迹监督的奖励在第4步之后崩溃。我们进一步在BFCL v3和NestFUL等基准上展示了持续的提升。消融研究确认所有奖励组件都是必要的,突显了多级监督对于组合推理的重要性。

英文摘要

Tool use enables large language models to solve complex tasks through sequences of API calls, yet existing reinforcement learning approaches fail to scale to multi-step composition settings. Outcome-based rewards provide only sparse feedback, while trajectory-supervised rewards depend on annotated reference solutions, penalizing valid alternatives and limiting scalability. We propose TIER: Trajectory-Invariant Execution Rewards, a reward framework that derives supervision directly from function schemas and runtime execution, rather than from reference trajectories. The reward decomposes into format validity, schema adherence, execution success, and answer correctness, providing dense, interpretable sequence-level feedback derived from fine-grained verification of individual steps of tool use. This design allows any valid execution path to receive credit, naturally supporting multiple solution strategies and adapting to evolving tool interfaces. On DepthBench, a compositional benchmark stratified by depth (1 to 6 steps), TIER achieves >90% accuracy across steps, where trajectory-supervised rewards collapse beyond step-4. We further demonstrate consistent gains on benchmarks like BFCL v3 and NestFUL. Ablation studies confirm that all reward components are necessary, highlighting the importance of multi-level supervision for compositional reasoning.

2605.16789 2026-05-19 cs.CV

Accelerating Rectified Flow Models via Trajectory-Aware Caching

通过轨迹感知缓存加速修正流模型

Xiao Liu, Kai Liu, Naiyang Guan, Hongliang Lu, Zhixin Wang, Zhikai Chen, Renjing Pei, Yulun Zhang

AI总结 本文提出TACache框架,通过轨迹感知缓存技术,在无需训练的情况下加速生成高保真图像和视频,通过分解速度场并补偿误差,实现更高的采样速度。

Comments 22 pages,14 figures

详情
AI中文摘要

扩散和修正流(RF)模型能够生成高保真的图像和视频,但其迭代的速度场评估计算成本高。现有的缓存方法通过跳过时间步来加速采样,但其粗略的近似会导致在长间隔跳过时积累误差并降低质量。我们提出TACache(轨迹感知缓存),一种训练自由的加速框架,遵循跳过然后补偿的范式。TACache将离散的速度加速度沿RF轨迹进行正交分解,将其分为平行分量和正交残差,隔离每一步近似误差的幅度和方向源。该框架分为两个阶段:离线阶段,累积变化阈值在幅度和方向指标上产生跳过计划并限制每个跳过间隔可延伸的距离;在线阶段,每个跳过的步骤中,离线统计数据与样本的历史正交方向结合,重建跳过的速度而无需额外的模型评估。在BAGEL、FLUX.1-dev和Wan2.1-1.3B上的实验表明,TACache在文本到图像生成中实现了高达4.14倍的速度提升,在文本到视频生成中实现了2.11倍的速度提升,并在所有参考保真度度量上优于先前的基于缓存的方法。代码将很快发布。

英文摘要

Diffusion and rectified flow (RF) models generate high-fidelity images and videos, but their iterative velocity-field evaluations are computationally expensive. Existing caching methods accelerate sampling by skipping timesteps, yet their coarse approximations introduce accumulated errors over long skip intervals and degrade quality under aggressive acceleration. We propose TACache (Trajectory-Aware Cache), a training-free acceleration framework following a skip-then-compensate paradigm. TACache performs an orthogonal decomposition of discrete velocity acceleration along the RF trajectory into a parallel component and an orthogonal residual, isolating the magnitude and directional sources of per-step approximation error. The framework operates in two stages: offline, cumulative variation thresholds on the magnitude and direction indicators yield the skip schedule and bound how far each skip interval may extend; online, at each skipped step the offline statistics are combined with the sample's historical orthogonal direction to reconstruct the skipped velocity without additional model evaluations. Experiments on BAGEL, FLUX.1-dev, and Wan2.1-1.3B show that TACache achieves up to 4.14 speedup on text-to-image generation and 2.11 speedup on text-to-video generation, with consistent improvements over prior cache-based methods on all reference-based fidelity metrics. Code will be released soon.

2605.16787 2026-05-19 cs.LG cs.CL

The Unlearnability Phenomenon in RLVR for Language Models

在语言模型中RLVR的不可学习现象

Yulin Chen, He He, Chen Zhao

AI总结 本文研究了RLVR在提升大语言模型推理能力中的学习动态,发现即使存在正确回放,某些难例仍无法学习,揭示了当前RL方法在推理任务中的根本限制。

Comments Accepted to ICML 2026

详情
AI中文摘要

可验证奖励强化学习(RLVR)已被证明在提高大语言模型(LLM)的推理能力方面是有效的。然而,RLVR的学习动态仍缺乏深入研究。在本文中,我们揭示了一个反直觉的现象:在模型最初难以处理的硬例中,一个显著子集即使在存在正确回放的情况下仍无法学习。为了理解这一现象,我们首先证明了现有的优化和采样技术无法解决不可学习性。通过跨例梯度分析,我们显示不可学习的例子具有根本性的表示问题,其特征是与其余例子的梯度相似性低且推理模式不可泛化。我们进一步表明,表示缺陷在RL中难以缓解,因为数据增强无法提高梯度相似性。本研究为RLVR训练中的不可学习数据提供了首次系统的表征,并揭示了当前RL方法在推理任务中的根本限制。代码和数据可在https://github.com/yulinchen99/unlearnability-rlvr获取。

英文摘要

Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross-example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at \url{https://github.com/yulinchen99/unlearnability-rlvr}.

2605.16786 2026-05-19 cs.LG

Lever: Speculative LLM Inference on Smartphones

Lever:智能手机上的推测LLM推理

Tuowei Wang, Fengzu Li, Yanfan Sun, Wei Gao, Ju Ren

AI总结 本文提出Lever系统,通过联合优化推测解码的三个阶段,在智能手机上实现高效的闪存支持的LLM推理,显著降低了推理延迟。

详情
AI中文摘要

大型语言模型(LLMs)在交互式移动应用中需求日益增加,但高质量模型超出了智能手机上有限的DRAM容量。闪存可以容纳更大的模型,但闪存支持的推理速度慢,因为自回归解码反复调用目标模型并产生昂贵的I/O。我们观察到推测解码非常适合这种环境:一个小型草稿模型可以保留在DRAM中,而一个更大的驻留于闪存的目标模型在每次调用中验证多个候选令牌。然而,现有方法假设服务器级加速器,并未考虑长时间I/O延迟、有限的计算并行性和不规则的推测执行。我们提出了Lever,一个用于智能手机上高效闪存支持LLM推理的端到端系统。Lever在移动约束下联合优化推测解码的三个阶段。在草稿阶段,它使用I/O和计算感知的增益-成本目标构建令牌树。在验证阶段,它通过早期退出预测修剪低价值分支以减少目标模型计算。在执行阶段,它将推测高效地映射到移动CPU-NPU硬件以提高利用率。全面评估显示,Lever将推理延迟降低了2.93倍于基准闪存卸载推理,1.50倍于传统推测解码,缩小了闪存支持与内存驻留LLM推理之间的延迟差距。

英文摘要

Large language models (LLMs) are increasingly needed for interactive mobile applications, but high-quality models exceed the limited DRAM available on smartphones. Flash storage can hold larger models, yet flash-backed inference is slow because autoregressive decoding repeatedly invokes the target model and incurs costly I/O. We observe that speculative decoding is a natural fit for this setting: a small draft model can remain in DRAM, while a larger flash-resident target model verifies multiple candidate tokens per invocation. However, existing methods assume server-class accelerators and fail to account for prolonged I/O latency, limited computation parallelism, and irregular speculation execution. We present Lever, an end-to-end system for efficient flash-backed LLM inference on smartphones. Lever jointly optimizes the three stages of speculative decoding under mobile constraints. For drafting, it builds token trees using an I/O- and compute-aware gain-cost objective. For verification, it prunes low-value branches through early-exit prediction to reduce target-model computation. For execution, it maps speculation efficiently across mobile CPU-NPU hardware to improve utilization. Comprehensive evaluations show that Lever reduces inference latency by an average of 2.93x over baseline flash-offloaded inference and 1.50x over conventional speculative decoding, narrowing the latency gap between flash-backed and memory-resident LLM inference.

2605.16785 2026-05-19 cs.CV cs.AI

Encoding Robust Topological Signatures for Hyperdimensional Computing

为超维计算编码鲁棒的拓扑特征

Arpan Kusari

AI总结 本文提出了一种基于拓扑特征的超维计算方法,通过提取离散拓扑原始特征并结合RTS不变的形状签名,提高了超维计算在旋转、噪声和遮挡等扰动下的鲁棒性,实验表明其在多个数据集上优于传统方法。

详情
AI中文摘要

超维(HD)计算由于其简单性、快速的原型基推断和与在线更新的兼容性,为边缘学习提供了一个有吸引力的替代方案。然而,标准的基于像素的HD编码器容易受到分布偏移的影响,如旋转、噪声或遮挡,会显著降低准确性。我们从二值化形状中提取离散拓扑原始特征——尤其是孔洞,并将它们与旋转/平移/缩放(RTS)不变的形状签名配对。我们的方法为(i)外轮廓使用空间金字塔变体的Zernike矩构建RTS稳定的描述符,(ii)每个孔洞使用其径向签名的内在傅里叶描述符以及RTS-标准相对几何。每个原始特征通过随机投影和角色绑定映射到双极超向量,并通过排列不变的捆绑聚合变量卡数的孔洞集以形成单个图像超向量。为了避免过度加权任何线索,我们通过在验证集上融合余弦相似度学习Zernike和孔洞通道的非负可靠性权重。在MNIST和EMNIST数据集上进行的实验表明,拓扑引导的HD计算相比传统HD基线显著提高了鲁棒性,保持了多个扰动家族的高精度,并受益于轻量级在线训练。与在干净数据上训练的紧凑CNN相比,我们的方法在清洁精度上具有竞争力,同时对几种像素级扰动具有明显更强的鲁棒性,证明了显式拓扑结构是实现鲁棒HD表示的可行途径。代码在https://github.com/arpan-kusari/Topological-HDC提供。

英文摘要

Hyperdimensional (HD) computing offers an attractive alternative to deep networks for edge learning due to its simplicity, fast prototype-based inference, and compatibility with online updates. However, standard pixel-based HD encoders are brittle: small distribution shifts such as rotation, noise, or occlusion can drastically reduce accuracy. We extract discrete topological primitives-most notably holes-from binarized shapes and pair them with rotation/translation/scale (RTS)-invariant shape signatures. Our method constructs RTS-stable descriptors for (i) the outer shape using a spatial-pyramid variant of Zernike moments and (ii) each hole using an intrinsic Fourier descriptor of its radial signature together with RTS-canonical relative geometry. Each primitive is mapped to a bipolar hypervector via randomized projection and role binding, and variable-cardinality hole sets are aggregated by permutation-invariant bundling to form a single image hypervector. To avoid over-weighting any cue, we learn nonnegative reliability weights for the Zernike and hole channels on a validation set via late fusion of cosine similarities. Experiments on MNIST and EMNIST under controlled corruptions (rotation, Gaussian noise, salt-and-pepper, cutout, zoom) show that Topology-guided HD computing substantially improves robustness compared with a naive HD baseline, maintaining high accuracy across multiple corruption families and benefiting from lightweight online training. Compared with a compact CNN trained on clean data, our method achieves competitive clean accuracy while offering markedly stronger robustness to several pixel-level corruptions, demonstrating that explicit topological structure is a practical route to robust HD representations. The code is provided at https://github.com/arpan-kusari/Topological-HDC.

2605.16779 2026-05-19 cs.CV cs.AI

A Holistic Method for Superquadric Fitting Using Unsupervised Clustering Analysis

一种基于无监督聚类分析的超二次曲面拟合整体方法

Mingyang Zhao, Sipu Ruan, Xiaohong Jia

AI总结 本文提出了一种新的方法,用于在存在噪声和异常值的情况下对点云进行超二次曲面拟合,通过无监督聚类分析重新定义问题,实现了刚性和变形超二次曲面的一体化拟合,同时提供了闭式解析解和收敛性证明。

Comments 20 pages, Code: https://github.com/zikai1/SuperquadricFitting

详情
Journal ref
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2026
AI中文摘要

本文提出了一种新的方法,用于在存在噪声和异常值的情况下对点云进行超二次曲面拟合,该方法在多个领域具有广泛的应用。与以往仅专注于拟合刚性或变形超二次曲面或存在鲁棒性和数值稳定性问题的方法不同,我们的方法从无监督聚类的新视角重新定义问题,使刚性和变形超二次曲面的拟合能够在统一的框架中完成。我们的方法核心是一种受无监督聚类分析启发的稳定优化函数,其中我们将点云数据和潜在参数曲面的样本分别作为聚类成员和质心。然后,具有动态更新质心位置的聚类过程成为优化超二次曲面参数的直接代理,建立了几何拟合与聚类动态之间的原则性联系。我们进一步推导了聚类质心与聚类成员之间的成对计算与正交距离之间的关系,从而有效消除了耗时的曲面采样过程。此外,我们的公式为模糊成员度向量和协方差矩阵提供了闭式解析解,确保了高效迭代优化,并能够更有效地处理几何变形。此外,我们还提供了收敛性分析的理论证明,并证明了聚类启发的拟合方法通过内在增加目标函数的凸性来逃避局部极小值。实现已公开在https://github.com/zikai1/SuperquadricFitting。

英文摘要

This work presents a novel method for fitting superquadrics to point clouds under the contamination of noise and outliers, which has many applications for shape modeling across diverse fields. Unlike prior approaches that either exclusively focus on fitting rigid or deformable superquadrics, or suffer from robustness and numerical instability issues, our method redefines the problem from a new unsupervised clustering perspective, enabling the holistic fitting of both rigid and deformable superquadrics within a unified framework. Central to our approach is a stable optimization function inspired by unsupervised clustering analysis, where we formulate the point cloud data and samples from the potential parametric surface as clustering members and centroids, respectively. Then, the clustering process with dynamic updates to centroid locations serves as a direct proxy for optimizing superquadric parameters, establishing a principled link between geometric fitting and clustering dynamics. We further derive the relationship between pairwise computations of clustering centroids and clustering members to orthogonal distances, effectively eliminating the need for the time-consuming surface sampling process. Moreover, our formulation provides closed-form analytical solutions for both the fuzzy membership degree vector and the covariance matrix, ensuring efficient iteration optimization and enabling more effective handling of geometric deformations. In addition, we provide a theoretical certificate of convergence analysis and demonstrate that the clustering-inspired fitting method can escape local minima by inherently increasing the convexity of the objective function. The implementation is publicly available at https://github.com/zikai1/SuperquadricFitting.

2605.16776 2026-05-19 cs.LG cs.AI

Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

可区分删除:统一知识擦除与拒绝用于大语言模型去学习

Puning Yang, Junchi Yu, Qizhou Wang, Philip Torr, Bo Han, Xiuying Chen

AI总结 本文提出D^2方法,通过限制潜在表示中的响应分布来擦除不受欢迎的知识,同时区分保留知识,从而实现安全且一致的拒绝机制,以提高大语言模型去学习的效果。

Comments ICML2026 Accepted

详情
AI中文摘要

减轻敏感和有害输出对于确保大型语言模型(LLM)的安全部署至关重要。现有方法通常遵循两种范式:知识删除(KD),在训练期间擦除不受欢迎的信息,以及可区分拒绝(DR),在推理期间引导模型远离使用敏感知识。尽管进展迅速,基于KD的去学习在抑制特定令牌序列作为完整知识移除替代物时面临偏见删除的问题,而基于DR的去学习则因底层知识仍然完整而有重新出现有害知识的风险。为了解决这些问题,我们提出了可区分删除(D^2),一种通过限制潜在表示中的响应分布来擦除不受欢迎知识,同时区分保留知识的范式,从而能够安全且一致地处理去学习的输入。为了实现D^2,我们引入了一个能量指数,该指数量化了知识的存在以及去学习内容与保留内容之间的分离。数学和实证分析表明,能量既准确又高效,使能量基于去学习对齐(EUA)能够在训练期间强制执行能量边界去学习,并在推理时应用基于能量的拒绝机制。广泛的实验表明,EUA显著优于先前方法,表明D^2的优越性。我们的代码可在https://github.com/Puning97/EUA-for-LLM-Unlearning获取。

英文摘要

Mitigating sensitive and harmful outputs is fundamental to ensuring safe deployment of LLMs. Existing approaches typically follow two paradigms: Knowledge Deletion (KD), which erases undesirable information during training, and Distinguishable Refusal (DR), which steers models away from using sensitive knowledge during inference. Despite rapid progress, KD-based unlearning struggles with biased deletion due to suppressing specific token sequences as a substitute for complete knowledge removal, whereas DR-based unlearning risks the re-emergence of harmful knowledge because the underlying knowledge remains intact. To address these issues, we propose Distinguishable Deletion ($\mathrm{D^2}$), a paradigm that restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a refusal mechanism to handle unlearned inputs safely and coherently. To implement $\mathrm{D^2}$, we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses show that energy is both accurate and efficient, enabling Energy-based Unlearning Alignment (EUA) to enforce energy-boundary unlearning during training and apply an energy-based refusal mechanism at inference. Extensive experiments demonstrate that EUA significantly outperforms previous methods, indicating the superiority of $\mathrm{D^2}$. Our code is available at https://github.com/Puning97/EUA-for-LLM-Unlearning.