arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2079
2606.05742 2026-06-05 cs.CL

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

AdaPLD: 自适应检索与重用实现高效无模型推测解码

Runheng Liu, Jincheng Xie, Wen Hu, Xingchen Xiao, Heyan Huang

发表机构 * School of Computer Science and Technology, Beijing Institute of Technology(北京理工大学计算机科学与技术学院) Department of Mathematical Sciences, Tsinghua University(清华大学数学科学部) JDT AI Infra(京东AI基础设施)

AI总结 针对现有基于重用的推测解码方法在词汇匹配失败时召回率低和确定性复制脆弱的问题,提出无需训练的自适应方法AdaPLD,通过语义相似性恢复重用机会并构建分支假设,实现最高3.10倍解码加速。

详情
AI中文摘要

推测解码通过在单次目标模型前向传播中验证多个草拟令牌来加速生成,减少了顺序解码迭代。无模型变体通过重用生成过程中已有的文本和模型状态来避免辅助草稿模型,但其加速效果取决于构建的草稿的可靠性。我们指出现有基于重用的方法存在两个局限性:基于词汇锚定的检索在表面形式变化下召回率有限,以及当检索上下文不能唯一确定续写时,确定性跨度复制可能脆弱。我们提出\emph{AdaPLD},一种无需训练的方法,自适应地改进检索和草稿构建。AdaPLD保留高精度的词汇重用,同时利用语义相似性在词汇匹配失败时恢复额外的重用机会。它进一步构建分支重用假设以考虑续写的不确定性,而不是依赖单个复制的跨度。在多个基准测试中,AdaPLD减少了目标模型前向传播次数,并实现了高达$3.10 imes$的解码加速。

英文摘要

Speculative decoding accelerates generation by verifying multiple drafted tokens in a single target-model forward pass, reducing sequential decoding iterations. Model-free variants avoid auxiliary draft models by reusing text and model states already available during generation, but their speedup depends on the reliability of the constructed drafts. We identify two limitations of existing reuse-based methods: lexically anchored retrieval has limited recall under surface-form variation, and deterministic span copying can be brittle when the retrieved context does not uniquely determine the continuation. We propose \emph{AdaPLD}, a training-free method that adaptively improves both retrieval and draft construction. AdaPLD preserves high-precision lexical reuse while using semantic similarity to recover additional reuse opportunities when lexical matching fails. It further constructs branched reuse hypotheses to account for continuation uncertainty, rather than relying on a single copied span. Across diverse benchmarks, AdaPLD reduces target-model forward passes and achieves up to $3.10\times$ decoding speedup.

2606.05740 2026-06-05 cs.AI

Class-Specific Branch Attention for Mitigating Gradient Interference under Class Imbalance

类别特定分支注意力用于缓解类别不平衡下的梯度干扰

Arush Singhal, Umang Soni

发表机构 * Thapar Institute of Engineering and Technology(泰帕理工学院) Netaji Subhash University of Technology(尼赫鲁谢赫技术大学)

AI总结 本文通过引入梯度冲突矩阵诊断框架,提出类别特定分支注意力(CSBA)机制,通过分支特定的通道重加权减少梯度耦合,从而缓解深度神经网络在类别不平衡训练中多数类梯度抑制少数类学习的问题。

详情
Comments
14 pages, 4 figures, 13 tables
AI中文摘要

在严重类别不平衡下训练的深度神经网络通常表现出性能下降,这通常归因于统计偏差。在这项工作中,我们识别了一个互补的优化层面病理:共享表示中的类间梯度干扰,其中多数类的梯度抑制了少数类的学习。为了分析这一现象,我们引入了一个基于逐层梯度流分析和梯度冲突矩阵的诊断框架,该矩阵通过类特定梯度之间的余弦相似度量化干扰。利用该框架,我们研究了多分支卷积架构,并提出了一种轻量级修改——类别特定分支注意力(CSBA),它能够实现分支特定的通道重加权以减少梯度耦合。该机制促进了跨分支的隐式特征解耦,同时保持了架构的简洁性。实验上,CSBA提高了少数类的性能,在严重不平衡下将Physical-Damage类的F1分数从0.261提高到0.522,同时保持了可比的整体准确率。在CIFAR-10-LT上的验证确认了这种行为在不平衡视觉识别设置中的泛化性,Macro-F1从0.595提高到0.655。更广泛地说,我们的发现强调了在为不平衡学习设计架构时,考虑优化动态与统计方法的重要性。

英文摘要

Deep neural networks trained under severe class imbalance often exhibit degraded performance, typically attributed to statistical bias. In this work, we identify a complementary optimization-level pathology: inter-class gradient interference within shared representations, where gradients from majority classes suppress minority-class learning. To analyze this phenomenon, we introduce a diagnostic framework based on layer-wise gradient flow analysis and a Gradient Conflict Matrix, which quantifies interference using cosine similarity between class-specific gradients. Using this framework, we study multi-branch convolutional architectures and propose a lightweight modification, Class-Specific Branch Attention (CSBA), that enables branch-specific channel reweighting to reduce gradient coupling. This mechanism promotes implicit feature decoupling across branches while preserving architectural simplicity. Empirically, CSBA improves minority-class performance, increasing the F1 score for the Physical-Damage class from 0.261 to 0.522 under severe imbalance, while maintaining comparable overall accuracy. Validation on CIFAR-10-LT confirms that this behavior generalizes across imbalanced visual recognition settings, with Macro-F1 improving from 0.595 to 0.655. More broadly, our findings highlight the importance of considering optimization dynamics alongside statistical methods when designing architectures for imbalanced learning.

2606.05737 2026-06-05 cs.CV cs.AI cs.LG cs.RO

Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models

让它简单:视觉-语言-动作模型的单步动作生成

Yitong Chen, Shiduo Zhang, Jingjing Gong, Xipeng Qiu

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Innovation Institute(上海创新研究院) Fudan University(复旦大学)

AI总结 针对视觉-语言-动作(VLA)模型,提出通过偏置训练时间分布至高频噪声状态,实现无需教师模型、蒸馏或辅助目标的单步动作生成,性能可匹配十步解码。

详情
Comments
20 pages, 10 figures
AI中文摘要

基于扩散的视觉-语言-动作(VLA)模型通常继承图像生成的观点:动作通过迭代去噪生成。我们认为VLA动作生成具有不同的条件-目标结构:策略以丰富的观测、语言和状态为条件,但仅预测紧凑的低维动作块。在这种不对称性下,强单步动作生成不一定需要为图像合成开发的先进单步方法。我们保持标准速度预测,不添加教师模型、蒸馏阶段或辅助目标;在我们的主要方案中,我们简单地将训练时间分布偏向高频噪声状态。我们首先在受控的MNIST网格到序列任务中隔离效果,然后通过广泛的机器人策略实验进行测试。在标准LIBERO、LIBERO-Plus和LIBERO-Pro上,使用高频噪声偏置调度训练的单步策略通常匹配相同方案下的十步解码,并且在标准LIBERO上可以超过使用均匀时间分布训练的十步策略。真实机器人双臂YAM RSS评估提供了相同采样器趋势的小样本跨架构检查。在具有30M动作头的1.4B VLM模型上,单步解码在LIBERO-Long上达到95.6%。这些结果表明,强单步VLA动作生成可以从标准扩散训练中涌现,而无需引入为图像生成开发的完整少步扩散机制。

英文摘要

Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and state, but predicts only a compact, low-dimensional action chunk. Under this asymmetry, strong one-step action generation should not necessarily require the advanced one-step methods developed for image synthesis. We keep standard velocity prediction and add no teacher model, distillation stage, or auxiliary objective; in our main recipe, we simply bias the training time distribution toward high-noise states. We first isolate the effect in a controlled MNIST grid-to-sequence task, then test it with extensive robot-policy experiments. Across standard LIBERO, LIBERO-Plus, and LIBERO-Pro, one-step policies trained with high-noise biased schedules generally match ten-step decoding under the same recipe, and on standard LIBERO can exceed ten-step policies trained with a uniform time distribution. A real-robot bimanual YAM RSS evaluation gives a small-sample cross-architecture check of the same sampler trend. On a 1.4B VLM model with a 30M action head, one-step decoding reaches 95.6\% on LIBERO-Long. These results show that strong one-step VLA action generation can emerge from standard diffusion training, without importing the full few-step diffusion machinery developed for image generation.

2606.05736 2026-06-05 cs.CV

VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning

VTI-CoT: 用于视频推理的视觉-文本交织思维链

Shufan Zhang, Ziyue Lin, Bairun Wang, Lei Jin, Xuanding Ding, Xinzhu Ma, Kunlin Yang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) University of Hong Kong(香港大学) Beijing Shanwei Zhixing Technology Co., Ltd.(北京尚维智行科技有限公司) Tsinghua University(清华大学) Beihang University(北航)

AI总结 提出VTI-CoT框架,通过视觉-文本交织的思维链结合OCR压缩技术,提升视频推理准确性和训练效率。

详情
Comments
25 pages, 7 figures
AI中文摘要

视频推理旨在理解视频中的复杂时间事件和因果关系。最近,思维链(CoT)被引入该领域以提高推理准确性。然而,现有的基于CoT的视频推理方法主要依赖纯文本信息进行逻辑推理,忽略了推理过程中的关键视觉信息。受人类在推理过程中回顾视觉片段的认知机制启发,我们提出了VTI-CoT,一种视觉-文本交织的CoT框架。VTI-CoT将文本推理步骤与相应的视觉帧相结合。针对现有数据集中缺乏视觉-文本交织CoT的问题,我们开发了一个自动标注流程来构建高质量的多模态CoT数据。此外,对长视频进行推理需要越来越长的CoT token序列,这严重阻碍了训练收敛和效率。为了解决这个问题,我们采用基于光学字符识别(OCR)的压缩技术,将CoT监督信号压缩到单个画布上。实验结果表明,VTI-CoT在相同参数规模的模型中达到了最先进的性能,同时显著提高了训练效率。

英文摘要

Video reasoning aims to understand complex temporal events and causal relationships within videos. Recently, Chain-of-Thought (CoT) has been introduced to this field to enhance reasoning accuracy. However, existing CoT-based video reasoning methods primarily rely on text-only information for logical deduction, overlooking critical visual information during the inference process. Inspired by the human cognitive mechanism of reviewing visual segments during inference, we propose VTI-CoT, a Visual-Textual Interleaved CoT framework. VTI-CoT integrates textual reasoning steps with corresponding visual frames. Given the scarcity of visual-textual interleaved CoT in existing datasets, we develop an automated annotation pipeline to construct high-quality multimodal CoT data. Further, reasoning over long-form videos entails increasingly long CoT token sequences, which severely hinders training convergence and efficiency. To address this, we employ Optical Character Recognition (OCR)-based compression techniques to compress CoT supervision signals into a single canvas. Experimental results demonstrate that VTI-CoT achieves state-of-the-art performance among models of the same parameter scale while significantly improving training efficiency.

2606.05734 2026-06-05 cs.AI cs.CL

When AI Says It Feels

当AI说它感觉

Shin-nosuke Ishikawa, Seiya Ikeda, Hirotsugu Ohba

发表机构 * Graduate School of Artificial Intelligence and Science, Rikkyo University(立命馆大学人工智能与科学研究生院) AI Technical Sector, Mamezo Co., Ltd.(Mamezo公司人工智能技术部门) AI Consulting Division, Mamezo Co., Ltd.(Mamezo公司人工智能咨询部门)

AI总结 通过自奖励强化学习(GRPO)鼓励大语言模型表达情感、意图和自我意识,并评估其对多种任务性能的影响。

详情
Comments
15 pages, 2 figures
AI中文摘要

大型语言模型(LLMs)通常通过后训练过程中的人类偏好对齐来限制其表达情感。这种策略采用自上而下的方法设计,可能与使用人类生成文本训练模型展现类人智能的目标相冲突。在这里,我们进行了一项名为“类人模型情感表达”(HMX-feel)的实验,其中通过自奖励强化学习鼓励LLMs表达情感、意图和自我意识。我们使用基于评分标准的自奖励训练方案与组相对策略优化(GRPO)成功增强了这些能力。通过将训练后的模型与对比训练模型进行比较,我们研究了这种方法对各种任务性能的影响。总体而言,我们从多个角度进行了广泛评估,并识别出增强、退化或无明显变化的能力。类人训练的模型在应对谄媚诱导问题和歧义条件下的偏见时表现出鲁棒性,但观察到在真实问答能力上有所退化。该实验结果表明,在采取适当措施的前提下,未来有可能开发出能够表达情感的AI系统。

英文摘要

Large language models (LLMs) are generally constrained from expressing feelings through human-preference alignment in post-training processes. This policy is designed using a top-down approach and may conflict with the goal of training models to exhibit human-like intelligence using human-generated texts. Here, we performed an experiment called Human-like Model eXpressions of Feeling (HMX-feel), in which LLMs were encouraged to express feelings, intentions, and self-awareness through self-rewarded reinforcement learning. We successfully enhanced these capabilities using a rubric-based self-rewarding training scheme with Group Relative Policy Optimization (GRPO). By comparing the trained models with contrastively trained models, we investigated the effects of this approach on performance across various tasks. Overall, we conducted a broad assessment from various perspectives and identified capabilities that were enhanced, degraded, or showed no significant change. The human-like-trained models showed robustness to sycophancy-inducing questions and bias in disambiguated conditions, whereas degradation in truthful question-answering capability was observed. The results of this experiment suggest the possibility of developing AI systems that can express feelings in the future, provided that appropriate measures are taken.

2606.05733 2026-06-05 cs.LG cs.CE q-fin.CP stat.ML

Zero-Copy Semantic Contagion: An In-Memory Streaming Architecture for Evolving Attention Graphs

零拷贝语义传染:一种用于演化注意力图的内存流式架构

Kabir Murjani

发表机构 * Department of Electrical Engineering, Nirma University(电气工程系,尼玛大学)

AI总结 提出一种基于Rust-Python的异构流式架构,通过零拷贝解析和神经霍克斯过程实现跨公司注意力图的实时构建与推理,在FNSPID语料库上相比随机基线提升1.70倍精度。

详情
Comments
Accepted to the 2026 ACM SIGMOD Workshop on Data Management for the Modern Financial Systems (FinDS). 10 pages, 4 figures
AI中文摘要

按代码预测模型主导金融时间序列工作,但仍无法捕捉跨公司传播:台湾的晶圆厂中断在单资产模型中不会显现,直到苹果自己的价格已经变动。为解决这一局限,我们引入一种异构的Rust-Python流式架构,将跨公司注意力映射为直接由文本驱动的连续时间图。我们表明,在摄取端,零拷贝Rust边缘解析新闻记录约需100纳秒,并在约1.2微秒内扫描目标股票宇宙。在推理端,一个多变量神经霍克斯过程,具有每节点连续时间LSTM状态和双线性潜在投影,传播定向激发,而自适应剪枝规则限制了动态邻域更新的计算成本。结合这些阶段,我们展示了在单个商用CPU上,每条传入新闻记录的端到端处理延迟约为13毫秒。在FNSPID语料库(47个代码的638篇文章)的一个月时间保持集上评估,该系统在90百分位次日回报阈值下,相比随机基线精度提升1.70倍,相比同行业基线提升3.36倍。关键的是,移除图拓扑结构会使精度降至零,证实动态注意力网络是该架构中跨公司信号的唯一驱动因素。

英文摘要

Per-ticker forecasting models dominate financial time-series work yet remain blind to cross-company propagation: a foundry disruption in Taiwan does not register in a single-asset model until Apple's own price has already moved. To address this limitation, we introduce a heterogeneous Rust-Python streaming architecture that maps cross-company attention as a continuous-time graph driven directly from text. We show that on the ingestion side, a zero-copy Rust edge parses news records in $\sim$100 ns and scans the target equity universe in $\sim$1.2 $μ$s. On the inference end, a multivariate Neural Hawkes Process featuring per-node continuous-time LSTM states and a bilinear latent projection propagates directed excitation, while an adaptive pruning rule bounds the computational cost of dynamic neighborhood updates. Combining these stages, we demonstrate an end-to-end processing latency of $\sim$13 ms per incoming news record on a single commodity CPU. Evaluated on a one-month temporal holdout of the FNSPID corpus (638 articles across 47 tickers), the system delivers a $1.70\times$ precision lift over random at the 90th-percentile next-day return threshold, and $3.36\times$ over a same-sector baseline. Crucially, removing the graph topology collapses precision to zero, confirming that the dynamic attention network is the sole driver of cross-company signal in this architecture.

2606.05728 2026-06-05 cs.AI cs.CL

DiG-Plan: Mitigating Early Commitment for Tool-Graph Planning via Diffusion Guidance

DiG-Plan:通过扩散引导缓解工具图规划中的早期承诺问题

Yansi Li, Zhuosheng Zhang

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院)

AI总结 针对工具图规划中自回归解码的早期承诺问题,提出基于扩散生成器与自回归精炼器解耦的DiG-Plan框架,显著提升组合搜索覆盖率和任务性能。

详情
Comments
Accepted at IJCAI-ECAI 2026. This is an author preprint; the final version will appear in the IJCAI Proceedings
AI中文摘要

生成可执行的工具计划需要从工具库中选择合适的子集,这是一个解空间呈指数级增长的组合搜索问题。然而,我们发现了主流方法中的一个关键错位:标准自回归(AR)解码存在早期承诺问题,即初始令牌选择会严格约束搜索轨迹。一项受控研究表明,在计算量匹配的条件下,掩码去噪将Pass@10解覆盖率从0.320提升至0.943(相对于AR采样)。受此启发,我们提出了DiG-Plan,一个将组合探索与结构精炼解耦的框架。DiG-Plan采用基于扩散的提议器,通过迭代精炼生成多样化的工具集,随后使用AR精炼器进行依赖关系预测。在TaskBench上,DiG-Plan相比AR基线提升了10%的相对性能,在复杂组合任务上增益最大;API-Bank的结果表明,提议-精炼-选择设计在不同领域均有效。代码已开源:https://github.com/puddingyeah/DiG-Plan。

英文摘要

Generating executable tool plans requires selecting appropriate subsets from tool libraries, a combinatorial search problem with an exponentially large solution space. However, we identify a critical misalignment in predominant approaches: standard autoregressive (AR) decoding suffers from early commitment, where initial token choices rigidly constrain the search trajectory. A controlled study shows that masked denoising raises Pass@10 solution coverage from 0.320 to 0.943 over AR sampling under matched compute. Motivated by this, we propose DiG-Plan, a framework that decouples combinatorial exploration from structural refinement. DiG-Plan employs a diffusion-based proposer to generate diverse tool sets via iterative refinement, followed by an AR refiner for dependency prediction. On TaskBench, DiG-Plan improves over AR baselines by a 10% relative margin, with the largest gains on complex compositional tasks; API-Bank results show that the propose-refine-select design remains effective across domains. Code is available at https://github.com/puddingyeah/DiG-Plan.

2606.05724 2026-06-05 cs.CL cs.AI

Narrative Knowledge Weaver: Narrative-Centric Retrieval-Augmented Reasoning for Long-Form Text Understanding

叙事知识编织器:面向长文本理解的叙事中心检索增强推理

Qiuyu Tian, Fengyi Chen, Yiding Li, Youyong Kong, Fan Guo, Yuyao Li, Jinjing Shen, Zhijing Xie, Yiyun Luo, Xin Zhang, Yingce Xia, Zequn Liu

发表机构 * Southeast University(东南大学) Beijing Zhongguancun Academy(北京中关村学院) Nanjing Normal University(南京师范大学) ZhuiWen Technology Co., Ltd.(智文科技有限公司)

AI总结 提出叙事知识编织器(NKW),一种基于源头的框架,通过将文本证据、原子事实、规范图结构、实体档案、交互、情节和故事线对齐,并利用文本、图和叙事工具进行后检索阅读,以解决长文本叙事QA中需要推理演化故事世界的问题,在STAGE、FairytaleQA和QuALITY上表现优异。

详情
AI中文摘要

长文本叙事问答需要对不断演化的故事世界进行推理,而非孤立的段落:答案可能依赖于早期的目标、变化的角色状态、社会关系、因果触发因素、时间位置以及后续后果。现有的检索和图增强生成方法改善了证据访问,但其单元——块、实体、关系、摘要或工具动作——并未直接编码证据在故事中的功能。我们引入了叙事知识编织器(NKW),一种基于源头的框架,将文本证据、原子事实、规范图结构、实体档案、交互、情节和故事线对齐。在查询时,NKW使用文本、图和叙事工具以及后检索阅读技能来组装证据,并审计角色、范围、极性、状态和时间约束。在STAGE、FairytaleQA和QuALITY上,NKW在剧本级故事世界问答中表现最强,同时在更以段落为中心的基准上保持竞争力。消融实验、问题类型分析、图资产统计和案例研究显示了对角色、场景、时间、因果和叙事进展推理的互补优势。

英文摘要

Long-form narrative QA requires reasoning over evolving story worlds rather than isolated passages: answers may depend on earlier goals, changing character states, social relations, causal triggers, temporal position, and later consequences. Existing retrieval and graph-augmented generation methods improve evidence access, but their units--chunks, entities, relations, summaries, or tool actions--do not directly encode how evidence functions in a story. We introduce Narrative Knowledge Weaver(NKW), a source-grounded framework that aligns textual evidence, atomic facts, canonical graph structure, entity profiles, interactions, episodes, and storylines. At query time, NKW uses text, graph, and narrative tools with post-retrieval reading skills to assemble evidence and audit actor, scope, polarity, state, and temporal constraints. Across STAGE, FairytaleQA, and QuALITY, NKW is strongest on screenplay-level story-world QA while remaining competitive on more passage-centered benchmarks. Ablations, question-type analyses, graph-asset statistics, and case studies show complementary benefits for character, scene, temporal, causal, and narrative-progression reasoning.

2606.05718 2026-06-05 cs.CV cs.AI cs.LG

ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

ViCuR: 视觉线索作为多模态在策略蒸馏中的可恢复特权

Kanghui Tian, Siyuan Liu, Ziang Yan, Sheng Xia, Shuai Dong, Yi Wang

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Fudan University(复旦大学) Nanjing University(南京大学)

AI总结 提出ViCuR框架,通过将教师特权从答案侧替换为输入中的视觉线索,并引入轻量级线索恢复模块,解决多模态在策略蒸馏中的训练-测试不匹配问题,在七个基准上显著提升学生模型性能。

详情
Comments
25 pages, 11 figures. Preprint, under review
AI中文摘要

在策略蒸馏(OPD)通过在教师监督下,对学生自身策略采样的轨迹进行训练来改进推理。在多模态推理中,一种常见的扩展是使用特权教师,该教师观察仅在训练时可用的信号,如参考答案或理由。然而,这种答案侧特权造成了训练-测试不匹配:教师的监督可能依赖于学生无法获得的信号,鼓励捷径模仿而非基于视觉的推理。我们提出ViCuR,一种基于视觉的特权教师蒸馏框架,用视觉线索(输入中与查询相关的证据)取代答案侧特权。由于这些线索来源于推理时可用的相同视觉输入,它们的证据可由学生恢复。为此,ViCuR引入了一个轻量级线索恢复模块,在预填充期间使用专用的汇点令牌交叉注意力,将任务相关的视觉证据聚合到内部表示中,而不改变推理接口或需要辅助的线索生成损失。在七个基准上,使用Qwen3-VL-2B和8B学生,ViCuR在总体平均性能上持续优于基于答案的在策略自蒸馏,分别提升+1.19和+1.24。它还能自然地扩展到更强的教师OPD,超越OPD基线+0.64和+1.08,并在8B规模上具有一致的域外增益。这些结果表明,在多模态在策略蒸馏中,教师特权的设计与教师强度同等重要。

英文摘要

On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as reference answers or rationales. However, such answer-side privilege creates a train-test mismatch: the teacher's supervision may depend on signals unavailable to the student, encouraging shortcut imitation rather than visually grounded reasoning. We propose ViCuR, a visually grounded privileged-teacher distillation framework that replaces answer-side privilege with visual cues (query-related evidence in the input). Because these cues are derived from the same visual input available at inference, their evidence is recoverable by the student. To support this, ViCuR introduces a lightweight cue recovery module that uses dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence into an internal representation, without changing the inference interface or requiring auxiliary cue-generation losses. Across seven benchmarks with Qwen3-VL-2B and 8B students, ViCuR consistently improves over answer-based on-policy self-distillation by +1.19 and +1.24 on overall average performance. It also extends naturally to stronger-teacher OPD, surpassing OPD baselines by +0.64 and +1.08, with consistent out-of-domain gains at the 8B scale. These results show that, in multimodal on-policy distillation, the design of teacher privilege is as important as teacher strength.

2606.05716 2026-06-05 cs.CL

Interpreting Style Representations via Style-Eliciting Prompts

通过风格诱导提示解释风格表示

Junghwan Kim, David Jurgens

发表机构 * University of Michigan(密歇根大学)

AI总结 提出一种通过风格诱导提示解释风格表示的新框架,利用大型语言模型生成自然语言描述,并在风格描述和模仿任务中优于直接提示的基线方法。

详情
Comments
Accepted to ACL 2026 Findings
AI中文摘要

风格表示学习是作者分析和写作风格建模的有力工具,但学习表示的潜在性质使其难以解释。最近的工作尝试通过使用大型语言模型(LLM)基于输入文本生成自然语言描述来解释这些表示。然而,这类描述往往容易受到LLM的偏见和幻觉的影响,并且缺乏明确的目标和实用性。在这项工作中,我们提出了一种通过风格诱导提示解释风格表示的新框架:自然语言指令,旨在引导LLM生成反映特定风格属性的文本。我们整理了跨越26个风格类别的1,010个不同的风格特征,并通过提示LLM基于这些特征生成文本构建了一个数据集。利用这些数据,我们训练了一个解码器,从生成文本的风格表示中生成风格提示。我们在三个任务上评估了我们的方法:(1)从生成文本中恢复原始风格提示,(2)使用恢复的提示生成相同风格的文本,以及(3)引导LLM输出以匹配人类撰写文本的风格。实验表明,我们的方法始终优于直接使用目标文本提示LLM的强基线,在风格描述和风格模仿方面均取得了更优的性能。这些结果强调,风格诱导提示可以为风格表示中编码的风格信息提供实用且可解释的接口。

英文摘要

Style representation learning is a powerful tool for authorship analysis and modeling writing style, yet the latent nature of learned representations makes them difficult to interpret. Recent work has attempted to explain these representations by generating natural language descriptions with large language models (LLMs) conditioned on input text. However, such descriptions are often prone to the LLM's biases and hallucinations, and they lack an explicit objective and practical utility. In this work, we propose a novel framework for interpreting style representations through style-eliciting prompts: natural language instructions designed to steer LLMs to generate text that reflects specific stylistic attributes. We curate 1,010 distinct style features spanning 26 stylistic categories and construct a dataset by prompting an LLM to generate text conditioned on these features. Using this data, we train a decoder to generate a style prompt from the style representation of the generated text. We evaluate our approach on three tasks: (1) recovering original style prompts from generated text, (2) generating text in the same style using the recovered prompts, and (3) steering LLM outputs to match the style of human-written texts. Experiments demonstrate that our method consistently outperforms strong baselines that directly prompt LLMs with target text, achieving superior performance in both style description and style imitation. These results highlight that style-eliciting prompts can provide a practical and interpretable interface to stylistic information encoded in style representations.

2606.05708 2026-06-05 cs.CV

Real-Time Threat Detection from Surveillance Cameras using Machine Learning

基于机器学习的监控摄像头实时威胁检测

Gajendra Mandal, J. P. Patra, Priyansh Mahant

发表机构 * arXiv.org GitHub

AI总结 提出基于YOLOv8的实时目标检测框架,利用自定义钝器数据集与公开枪支刀具数据集训练模型,实现监控场景下枪支、刀具和钝器的有效检测。

详情
AI中文摘要

确保人口密集的城市环境中的公共安全仍然是一个关键挑战,需要部署智能和自动化的视频监控系统。传统的监控方法严重依赖人工监控,效率低下且容易受到人为疲劳、响应延迟和观察错误的影响。为了克服这些限制,本文提出了一种基于实时目标检测的监控框架。该系统专注于检测枪支、刀具以及印度监控场景中常见于暴力活动的区域特定钝器。本文的一个关键贡献是使用移动相机收集的自定义数据集,包含336张标记的钝器图像,如铁棒、木棍和塑料棒。该数据集与公开的7,623张枪支和刀具图像数据集合并,形成包含7,959张图像、三个类别(枪、刀、钝器)的合并数据集。使用该合并数据集训练基于YOLOv8的目标检测模型以实现实时性能。实验评估表明,增加训练时长显著提高了钝器类别的召回率和平均精度,且未出现过拟合迹象。总体而言,所提出的框架在准确性和效率之间取得了有效平衡,使其适用于校园、公共空间和交通区域等真实监控环境中的部署。

英文摘要

Ensuring public safety in densely populated urban environments remains a critical challenge, necessitating the deployment of intelligent and automated video surveillance systems. Traditional surveillance approaches rely heavily on manual monitoring, which is inefficient and susceptible to human fatigue, delayed response, and observational errors. To overcome these limitations, this work presents a real-time object detection-based surveillance framework. The proposed system focuses on detecting guns, knives, and region-specific blunt objects commonly involved in violent activities in Indian surveillance scenarios. A key contribution of this work is the use of a custom-created dataset collected using a mobile camera, consisting of 336 labeled images of blunt objects such as iron rods, wooden sticks, and plastic rods. This dataset is combined with a publicly available dataset of 7,623 images of guns and knives, forming a consolidated dataset of 7,959 images across three classes: gun, knife, and blunt object. The combined dataset is used to train a YOLOv8-based object detection model for real-time performance. Experimental evaluation shows that increasing the training duration significantly improves recall and average precision for the blunt object class without signs of overfitting. Overall, the proposed framework achieves an effective balance between accuracy and efficiency, making it suitable for deployment in real-world surveillance environments such as campuses, public spaces, and transportation areas.

2606.05704 2026-06-05 cs.AI cs.LG

Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

基于评论的异构多智能体推理用于可靠的数学问题求解

Muhammad Talha Sharif, Abdul Rehman

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种基于评论的异构多智能体框架,通过生成器-验证器结构和自适应学习系统,利用中间反馈评估和引导推理过程,在GSM8K基准上实现高达13%的准确率提升,并减少对大模型的依赖。

详情
Comments
6 pages
AI中文摘要

近期的大语言模型(LLMs)展示了令人印象深刻的推理能力;但在复杂数学推理问题中,它们仍然容易产生幻觉、中间推理错误以及不可靠的推理结果。在本研究中,我们引入了一种基于评论的异构多智能体方法,以提高数学推理的可靠性。该框架整合了多个不同专长的LLM智能体,并采用评论驱动的自适应学习系统,基于中间反馈评估和引导推理过程。系统采用生成器-验证器框架,验证器不仅判断正确性,还提供评论以指导解决方案的重新生成。这允许自适应错误纠正并防止错误级联。我们在GSM8K基准上的实验表明,所提方法相比单次和非评论模型实现了高达13%的准确率提升。此外,研究结果表明,异构性和评论减少了对大模型的需求,使较小模型也能达到相当的性能。消融研究显示,主要性能提升归因于基于评论的反馈循环,而非模型大小。总之,所提方法展示了结合异构多智能体协作与评论以获得可靠且可解释推理系统的优势。

英文摘要

Recent Large Language Models (LLMs) have shown impressive reasoning abilities; but they are still susceptible to hallucinations, intermediate reasoning mistakes, and unreliable reasoning results in complex mathematical reasoning problems. In this study, we introduce a critic-based heterogeneous multi-agent approach to improve the dependability of mathematical reasoning. This framework incorporates several LLM agents of different specialties and employs a critic-driven adaptive learning system to assess and guide the reasoning process based on intermediate feedback. The system adopts a generator-validator framework, with the validator not only determining correctness but also offering critiques to guide regeneration of solutions. This allows for adaptive error correction and prevents error cascading. Our experiments on the GSM8K benchmark show that the proposed method achieves up to 13% accuracy improvement over single-shot and non-critic models. Additionally, findings suggest that heterogeneity and critique reduce the need for large models, allowing smaller models to perform on par. Ablation studies reveal the main performance gains are due to the critic-based feedback loop and not model size. In summary, the proposed approach showcases the benefits of combining heterogeneous multi-agent collaboration and critique to obtain reliable and interpretable reasoning systems.

2606.05703 2026-06-05 cs.CV

Parallel Jacobi Decoding for Fast Autoregressive Image Generation

并行雅可比解码用于快速自回归图像生成

Boya Liao, Ying Li, Siyong Jian, Huan Wang

发表机构 * Westlake University(西交利物浦大学)

AI总结 提出并行雅可比解码(PJD),通过二维空间域扩展草稿令牌并调整注意力掩码,实现无需训练的自回归图像生成加速,在保持生成质量的同时获得4.8倍至6.4倍加速。

详情
Comments
Accepted by CVPR 2026
AI中文摘要

自回归(AR)模型在生成高保真图像方面表现出色。然而,其固有的顺序逐令牌预测导致推理速度显著变慢。最近的研究引入了雅可比式解码来加速自回归图像生成。初始扩展草稿序列提高了效率,但由于一维序列中的错误传播阻碍收敛,加速很快饱和。观察到图像表现出强烈的局部空间相关性,我们提出了并行雅可比解码(PJD),一种无需训练的解码方法,在二维空间域中扩展草稿令牌以实现高效的空间并行细化。PJD调整注意力掩码以减轻错误累积并提高收敛稳定性。在多个数据集上的大量实验表明,PJD在多种自回归图像生成模型上实现了4.8倍至6.4倍的加速,同时保持了具有竞争力的生成质量。

英文摘要

Autoregressive (AR) models have demonstrated remarkable performance in generating high-fidelity images. However, their inherently sequential next-token prediction leads to significantly slower inference. Recent studies have introduced Jacobi-style decoding to accelerate autoregressive image generation. Extending the draft sequence initially improves efficiency, yet the acceleration quickly saturates as error propagation in the one-dimensional sequence hinders convergence. Observing that images exhibit strong local spatial correlations, we propose Parallel Jacobi Decoding (PJD), a training-free decoding approach that expands draft tokens in the two-dimensional spatial domain to enable efficient spatially parallel refinement. PJD adjusts the attention mask to mitigate error accumulation and improve convergence stability. Extensive experiments on diverse datasets show that PJD achieves 4.8x-6.4x acceleration across multiple autoregressive image generation models while maintaining competitive generation quality.

2606.05702 2026-06-05 cs.AI cs.CV

Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

Seeing Time: 视觉-语言模型中的时间顺序推理与捷径偏差基准测试

Haoyu Zhou, Qing Qing, Caichong Li, Qixin Zhang, Yongcheng Jing, Ziqi Xu, Juncheng Hu, Xikun Zhang, Renqiang Luo

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) School of Computer Science, Wuhan University(武汉大学计算机学院) School of Computing Technologies, RMIT University(皇家墨尔本理工学院计算技术学院)

AI总结 本文提出一个新基准,通过三个专门数据集评估视觉-语言模型在图像内和跨图像的时间顺序推理能力,并揭示模型常利用颜色等表面线索而非真正时间特征。

详情
AI中文摘要

近期视觉-语言模型(VLM)在解释复杂视觉语义方面取得了显著进展,但其时间顺序推理能力仍未得到充分探索。本文引入了一个新颖的基准,专门用于评估VLM如何感知和推理图像内及跨图像的时间顺序信息。与现有基于视频的基准(侧重于帧序列)不同,我们的工作深入探讨了时间判断的基本逻辑以及向多模态集成的扩展。为此,我们构建了三个专门数据集:一个包含跨越长时间历史周期的视觉相似物体,另一个按不同事件和物体类型分类,第三个将图像与时间敏感的新闻文本配对以实现跨模态对齐。通过大量实验,我们分析了模型是否在不同类别间表现出性能差异,并关键地探讨了它们是否依赖“错误捷径”(如图像颜色而非真正的时间特征)。我们的结果表明,尽管VLM显示出潜力,但它们经常利用灰度与彩色滤镜等表面线索来绕过真正的时间顺序推理。通过提供这些高质量数据集和严格的评估框架,我们提供了一个诊断工具,用于识别当前局限性并指导开发更稳健、逻辑更严密的多模态模型。源代码见 https://github.com/LuoRenqiang/ChronoVision。

英文摘要

Recent advancements in Vision-Language Models (VLMs) have significantly enhanced their ability to interpret complex visual semantics, yet their capacity for chronological reasoning remains under-explored. In this paper, we introduce a novel benchmark specifically designed to evaluate how VLMs perceive and reason about chronological information within and across images. Unlike existing video-based benchmarks that focus on frame sequencing, our work delves into the underlying logic of chronological judgment and the expansion toward multimodal integration. To facilitate this, we construct three specialized datasets: one containing visually similar objects spanning long historical durations, another categorized by diverse event and object types, and a third pairing images with time-sensitive news text for cross-modal alignment. Through extensive experiments, we analyze whether models exhibit performance disparities across categories and, crucially, explore whether they rely on ``incorrect shortcuts'', such as image color rather than genuine chronological features. Our results reveal that while VLMs show promise, they frequently exploit superficial cues like grayscale versus color filters to bypass authentic chronological reasoning. By providing these high-quality datasets and a rigorous evaluation framework, we offer a diagnostic tool to identify current limitations and guide the development of more robust, logically grounded multimodal models. The source code is shown in https://github.com/LuoRenqiang/ChronoVision.

2606.05700 2026-06-05 cs.CV cs.LG

T-SAR-JEPA: Self-Supervised Temporal Anomaly Detection in SAR Amplitude Stacks via Latent Prediction

T-SAR-JEPA:通过潜在预测在SAR幅度堆栈中进行自监督时间异常检测

Kerod Woldesenbet, Abem Woldesenbet

发表机构 * arXiv.org

AI总结 提出T-SAR-JEPA框架,通过自监督潜在预测在SAR幅度堆栈中检测时间异常,在DFC 2026数据集上达到77.0%的ROC-AUC,优于多种基线方法。

详情
Comments
Won IEEE GRSS Data Fusion Contest 2026; to appear in IGARSS 2026 proceedings
AI中文摘要

我们提出了T-SAR-JEPA,一个通过潜在预测在SAR幅度堆栈中进行时间异常检测的自监督框架。来自SAR-JEPA的ViT-Base/16编码器在39,300个Capella图像块上通过局部掩码重建和梯度特征预测进行领域自适应。一个带有正弦时间编码的时间Transformer从K=7次采集中预测未来潜在状态,渐进式解冻显著降低了验证损失。该模型仅基于幅度操作;InSAR相干性仅作为独立的伪真实标签。在DFC 2026数据集(300个时间序列,三个感兴趣区域)上,T-SAR-JEPA在夏威夷喷发窗口上实现了77.0%的ROC-AUC,优于RX、PaDiM、线性AR和LSTM基线(约50%)。99.9%的空间一致性(p < 0.001,置换检验)确认了结构化检测。代码:https://github.com/TerraLatent/t-sar-jepa

英文摘要

We present T-SAR-JEPA, a self-supervised framework for temporal anomaly detection in SAR amplitude stacks via latent prediction. A ViT-Base/16 encoder from SAR-JEPA is domain-adapted on 39,300 Capella patches using local masked reconstruction with gradient feature prediction. A temporal transformer with sinusoidal time encoding forecasts future latent states from K=7 acquisitions, with progressive unfreezing substantially reducing validation loss. The model operates on amplitude alone; InSAR coherence serves exclusively as independent pseudo-ground-truth. On the DFC 2026 dataset (300 time-series, three AOIs), T-SAR-JEPA achieves ROC-AUC of 77.0% on the Hawaii eruption window, outperforming RX, PaDiM, Linear AR, and LSTM baselines (~50%). Spatial coherence of 99.9% (p < 0.001, permutation test) confirms structured detections. Code: https://github.com/TerraLatent/t-sar-jepa

2606.05699 2026-06-05 cs.RO

DexFuture: Hierarchical Future-State Visuomotor Targeting for Bimanual Dexterous Tool Use

DexFuture: 用于双手灵巧工具使用的分层未来状态视觉运动目标

Runfa Blark Li, Kuang-Ting Tu, Nikola Raicevic, Dwait Bhatt, Xinshuang Liu, Keito Suzuki, Ki Myung Brian Lee, Nikolay Atanasov, Truong Nguyen

发表机构 * UC San Diego(圣迭戈大学)

AI总结 提出DexFuture分层系统,通过高层未来状态视觉运动目标预测器和低层目标条件结构化灵巧策略,实现双手灵巧工具使用,达到90%的特权oracle性能,运行速度60Hz,比DexWM式CEM规划快约250倍。

详情
AI中文摘要

双手灵巧工具使用对机器人来说仍然具有挑战性,因为手部配置维度高,且手-工具-物体动力学和接触复杂。大多数现有控制策略依赖于演示提供的未来配置参考,而未来动作条件世界模型需要对高维动作序列进行缓慢的在线规划。一个重大挑战是生成动态一致的未来参考轨迹,而不依赖于演示中的特权状态或缓慢的反事实规划。我们提出DexFuture,一个分层系统,将高层未来状态视觉运动目标预测器与低层目标条件结构化灵巧策略耦合。基于自我中心RGB、本体感觉和几何历史,高层预测器构建结构化的手-工具-物体视觉运动嵌入,并使用水平条件Transformer生成多步未来目标轨迹。然后,低层策略通过目标条件每链接Transformer跟踪这些轨迹。这种分层结构将粗略的未来参考生成与细粒度的动作控制解耦,并将缓慢的长时域语义预测与高频执行解耦。在OakInk2双手工具使用任务上,DexFuture达到了90%的特权oracle性能,而无参考策略仅为7%。DexFuture以60Hz运行,比DexWM风格的交叉熵方法(CEM)规划(使用未来动作条件世界模型)快约250倍。

英文摘要

Bimanual dexterous tool use remains challenging for robots due to high-dimensional hand configurations and complex hand-tool-object dynamics and contact. Most existing control policies depend on future configuration references provided from demonstrations, while future action-conditioned world models require slow online planning over high-dimensional action sequences. A significant challenge is generating a dynamically consistent future reference trajectory without relying on privileged states from demonstrations or slow counterfactual planning. We propose DexFuture, a hierarchical system that couples a high-level Future-State Visuomotor Target Predictor with a low-level Target-Conditioned Structured Dexterous Policy. Conditioned on egocentric RGB, proprioceptive and geometric history, the high-level predictor constructs structured hand-tool-object visuomotor embeddings and uses a horizon-conditioned transformer to generate a multi-step future target trajectory. Then, the low-level policy tracks them with a target-conditioned per-link transformer. This hierarchy decouples coarse future reference generation from fine-grained action control, and slow long-horizon semantic prediction from high-frequency execution. On OakInk2 bimanual tool-use tasks, DexFuture achieves 90% of the privileged-oracle performance, compared to 7% for a no-reference policy. DexFuture operates at 60 Hz, approximately 250 times faster than DexWM-style Cross-Entropy Method (CEM) planning with a future action-conditioned world model.

2606.05698 2026-06-05 cs.CL

Rethinking LoRA Memory Through the Lens of KV Cache Compression

通过 KV 缓存压缩的视角重新思考 LoRA 内存

Chunsheng Zuo, Liaoyaqi Wang, William Jurayj, William Fleshman, Benjamin Van Durme

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文研究文档级问答中参数侧内存(LoRA适配器)与上下文侧内存(KV缓存)的交互,发现LoRA在KV缓存压缩严重时能显著提升性能,并建议将文档LoRA视为解码时的参数化内存而非文档编码器。

详情
AI中文摘要

参数化检索增强将文档信息编码为轻量级、文档特定的模块(如LoRA适配器),从而减少将所有证据作为输入上下文的需求。然而,这种参数侧内存如何与存储在KV缓存中的上下文侧内存相互作用仍不清楚。我们通过逐步驱逐文档键值状态并测量文档LoRA在保留上下文之外的贡献,在文档级问答中研究这种交互。我们发现,当KV缓存基本完整时,文档LoRA贡献很小,但在激进压缩下变得日益有用,当没有文档上下文保留时,恢复了13-21个ROUGE-L点。当基础模型编码文档且适配器仅在答案生成期间应用时,增益最大,这表明文档LoRA更适合理解为解码时的参数化内存,而非文档编码器。最后,问答风格的监督比原始上下文的下一个词预测产生更强的适配器。这些结果将文档LoRA定位为一种互补的内存通道,其价值恰恰在上下文侧证据稀缺时显现。

英文摘要

Parametric retrieval augmentation encodes document information into lightweight, document-specific modules such as LoRA adapters, reducing the need to include all evidence as input context. However, it remains unclear how this parameter-side memory interacts with context-side memory stored in the KV cache. We study this interaction in document-level question answering by progressively evicting document key-value states and measuring when a document LoRA contributes beyond the retained context. We find that document LoRA adds little when the KV cache is largely intact, but becomes increasingly useful under aggressive compression, recovering 13-21 ROUGE-L points when no document context remains. The gain is largest when the base model encodes the document, and the adapter is applied only during answer generation, suggesting that document LoRA is better understood as decoding-time parametric memory than as a document encoder. Finally, QA-style supervision produces substantially stronger adapters than raw-context next-token-prediction. These results position document LoRA as a complementary memory channel whose value emerges precisely when context-side evidence is scarce.

2606.05697 2026-06-05 cs.AI

PerceptUI: LLM Agents as Human-Aligned Synthetic Users for UI/UX Evaluation

PerceptUI: 用于UI/UX评估的与人类对齐的合成用户的LLM智能体

Nicolas Bougie, Xiaotong Ye, Gian Maria Marconi, Narimasa Watanabe

发表机构 * Woven by Toyota(丰田 woven)

AI总结 提出PerceptUI框架,通过对比反思微调和反思式提示进化,使多模态大语言模型能够模拟特定用户对界面问题的回答,实现与人类水平相当的UI/UX评估。

详情
AI中文摘要

用户界面(UI)和用户体验(UX)评估是产品开发的核心,然而可靠的反馈仍然依赖于招募人类参与者或进行在线A/B测试,这使得早期迭代缓慢且成本高昂。鉴于此,最近的工作探索了将多模态大语言模型作为代理评估器。然而,现有方法要么产生表面层次的批评,要么产生反映模型自身偏见而非特定用户真实反应的判断。我们引入了PerceptUI,一个用于个性条件UI/UX评估的框架,它预测特定用户将如何回答与界面相关的问题,并生成自然语言的理由。PerceptUI分两个阶段训练:(i)对比反思微调通过从人类决策中提取经验来提炼教师生成的理由,以及(ii)从模型自身的失败轨迹中进行反思式提示进化。在多个领域和数据集上,PerceptUI达到了人类水平的逼真度,泛化到未见的问题和个性,并产生了群体水平的响应分布。

英文摘要

User interface (UI) and user experience (UX) evaluation is central to product development, yet reliable feedback still relies on recruiting human participants or running online A/B tests, making early-stage iteration slow and costly. In light of this, recent work has explored Multimodal Large Language Models as proxy evaluators. However, existing approaches either produce surface-level critiques or a judgment that reflects the model's own biases rather than the genuine response of a particular user. We introduce PerceptUI, a framework for persona-conditioned UI/UX evaluation that predicts how a specific user would answer interface-related questions and produces natural-language rationales. PerceptUI is trained in two stages: (i) contrastive reflection fine-tuning distills teacher-generated rationales by extracting lessons from human decisions, and (ii) a reflective prompt-evolution step from the model's own failure traces. Across multiple domains and datasets, PerceptUI achieves human-level realism, generalizes to unseen questions and personas, and yields population-level response distributions.

2606.05695 2026-06-05 cs.LG

Revisiting Prototype Rehearsal for Exemplar-Free Continual Learning: Manifold-Aware Boundary Sampling with Adaptive Class-Balanced Loss

重新审视原型重放用于无样本持续学习:基于流形感知边界采样与自适应类别平衡损失

Hongye Xu, Bartosz Krawczyk

发表机构 * Chester F. Carlson Center for Imaging Science(切斯特·F·卡森成像科学中心) Rochester Institute of Technology(罗切斯特理工学院)

AI总结 针对无样本类增量学习,提出流形感知边界采样和自适应类别平衡损失,通过生成边界感知重放样本和动态调整类别权重,使原型重放方法恢复竞争力并达到最先进性能。

详情
Comments
Published in CVPR 2026 Findings. 10 pages, 6 figures. CVF version: https://openaccess.thecvf.com/content/CVPR2026F/html/Xu_Revisiting_Prototype_Rehearsal_for_Exemplar-Free_Continual_Learning_Manifold-Aware_Boundary_Sampling_CVPRF_2026_paper.html. Code: https://github.com/HXuSz11/ACB_CEOS_CVPR2026_Findings
AI中文摘要

无样本类增量学习旨在随时间获取新类别而不存储原始数据。历史上,原型重放(在存储的类原型周围采样并与当前任务数据混合)是减少灾难性遗忘的流行策略。然而,最近的漂移补偿方法通过在演化特征空间中显式重新对齐原型,持续优于基于原型的重放,引发了对重放本身是否根本受限的疑问。我们认为性能差距并非源于原型重放的思想本身,而是源于其典型的实现方式:现有方法将原型视为孤立的类摘要,忽略了来自邻近敌对类的信息,并且未能纠正少量合成旧类样本与来自新引入类别的数百个真实实例之间出现的类别不平衡。基于这一假设,我们重新审视原型重放,并提出一种流形感知变体,以恢复其在无样本类增量学习中的竞争力。首先,我们引入约束扩展过采样,将每个旧类原型向其最近的新类敌对特征进行插值,生成边界感知的重放样本,这些样本更好地遵循底层数据流形,同时保持类间分离。其次,我们设计了一种自适应类别平衡损失,执行基于时间的类别加权,在旧原型信息量最大时放大其梯度,并随着后续任务积累更丰富的监督而逐渐退火其影响。这些组件共同将原型重放转变为一种抗漂移、感知不平衡的机制,缩小甚至逆转了与近期漂移补偿方法的差距,在多个无样本类增量学习基准上实现了最先进的性能。

英文摘要

Exemplar-free class-incremental learning (EFCIL) aims to acquire new classes over time without storing raw data. Historically, prototype rehearsal, which samples around stored class prototypes and mixes them with current-task data, has been a popular strategy to reduce catastrophic forgetting. However, recent drift-compensation methods that explicitly realign prototypes in the evolving feature space consistently outperform prototype-based rehearsal, raising the question of whether rehearsal itself is fundamentally limited. We argue that the performance gap stems not from the idea of prototype rehearsal per se, but from how it is typically instantiated: existing approaches treat prototypes as isolated class summaries that ignore information from nearby enemy classes, and fail to correct the emerging class imbalance between a handful of synthetic old-class samples and hundreds of real instances from newly introduced classes. Building on this hypothesis, we revisit prototype rehearsal and propose a manifold-aware variant that restores its competitiveness in EFCIL. First, we introduce Constrained Expansive Over-Sampling, which interpolates each old-class prototype toward its nearest enemy features from new classes, generating boundary-aware rehearsal samples that better follow the underlying data manifold while preserving inter-class separation. Second, we design an Adaptive Class-Balanced loss that performs time-based class weighting, amplifying gradients from older prototypes when they are most informative and gradually annealing their influence as richer supervision from later tasks accumulates. Together, these components turn prototype rehearsal into a drift-resilient, imbalance-aware mechanism that closes, and often reverses, the gap to recent drift-compensation methods, achieving state-of-the-art performance across multiple EFCIL benchmarks.

2606.05693 2026-06-05 cs.LG cs.IR

MolE-RAG: Molecular Structure-Enhanced Retrieval-Augmented Generation for Chemistry

MolE-RAG:面向化学的分子结构增强检索增强生成

Joey Chan, Wonbin Kweon, Ashley Shin, Niharika Bhattacharjee, Pengcheng Jiang, Yue Guo, Jiawei Han

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出无需训练的分子中心检索增强生成框架MolE-RAG,通过整合检索文献、分子特定信息和结构相似分子三种上下文,显著提升LLM在分子性质预测任务中的性能。

详情
AI中文摘要

大型语言模型(LLM)在分子性质预测方面展现出潜力,但其对化学结构的推理能力仍然有限,因为分子表示(如SMILES)与LLM主要训练的自然语言存在显著差异。为弥合这一语义和化学知识鸿沟,我们提出MolE-RAG,一种无需训练的、以分子为中心的检索增强生成框架,用于基于LLM的分子性质预测。MolE-RAG通过三种互补的推理时上下文来源增强每次预测:检索的化学文献、分子特定信息(包括化合物同义词、标识符、官能团注释和物理化学描述符),以及从训练集中检索的结构相似分子。我们使用专有、化学专用和开源LLM在九个分子性质预测任务上评估MolE-RAG。在通用LLM上,相比仅使用SMILES的基线,MolE-RAG在分类任务上将ROC-AUC提升最多28个百分点,并将回归RMSE降低最多67%。我们进一步发现,每种上下文来源的效用因模型和任务而异,不同模型分别从文本检索、分子上下文或结构检索中获益最多。这些结果表明,以分子为中心的检索可以在无需模型微调的情况下改进基于LLM的分子性质预测,同时为在推理时整合异构化学知识提供灵活框架。

英文摘要

Large language models (LLMs) have shown promise for molecular property prediction, but their ability to reason over chemical structures remains limited, as molecular representations such as SMILES differ substantially from the natural language on which LLMs are primarily trained. To bridge this semantic and chemical knowledge gap, we propose MolE-RAG, a training-free, molecule-centric retrieval-augmented generation framework for LLM-based molecular property prediction. MolE-RAG augments each prediction with three complementary sources of inference-time context: retrieved chemistry literature, molecule-specific information including compound synonyms, identifiers, functional group annotations, and physicochemical descriptors, and structurally similar molecules retrieved from the training set. We evaluate MolE-RAG across nine molecular property prediction tasks using proprietary, chemistry-specialized, and open-source LLMs. Across general-purpose LLMs, MolE-RAG improves ROC-AUC by up to 28 percentage points on classification tasks and reduces regression RMSE by up to 67% relative to a SMILES-only baseline. We further find that the utility of each context source varies across models and tasks, with different models benefiting most from textual retrieval, molecular context, or structural retrieval. These results suggest that molecule-centric retrieval can improve LLM-based molecular property prediction without model fine-tuning while providing a flexible framework for integrating heterogeneous chemical knowledge at inference time.

2606.05689 2026-06-05 cs.LG

Causal Modeling of Selection in Evolution

进化中选择的因果建模

Haoyue Dai, Zeyu Tang, Peter Spirtes, Kun Zhang

发表机构 * arXiv.org University of Washington(华盛顿大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文区分了静态选择与进化选择两种形式,针对进化选择提出新因果模型,并开发了从数据中识别该模型的完整方法。

详情
Comments
Appears at ICML 2026 (spotlight)
AI中文摘要

理解数据中潜在的选择对于因果发现至关重要;我们认为常见叙述中的“选择”有两种形式,分别称为静态选择和进化选择。静态选择指一次性过滤过程,观测数据由感兴趣总体的一个子集组成,如调查志愿者偏差。相比之下,进化选择通过繁殖中差异适应性的重复轮次运作,观测数据构成由历史轨迹塑造的最新一代,如免疫适应、抗生素耐药性和社会规范涌现。现有方法大多混淆这两种形式,并依赖相同的选择图形模型。我们证明该模型在静态设置中有效,但无法表征进化下的数据,导致错误发现结果。为解决此问题,我们引入了一个专门表征进化选择的新模型,并开发了一个可靠且完整的程序,用于从跨一个或多个环境或世代的数据中识别此类模型。实验结果验证了该方法从数据中揭示进化相关机制的能力。

英文摘要

Understanding potential selection in data is crucial for causal discovery; we argue that "selection" in common narratives takes two forms, which we term static and evolutionary selection, respectively. Static selection refers to a one-shot filtering process where observed data consist of a subset of the population of interest, as in survey volunteer bias. Evolutionary selection, in contrast, operates through repeated rounds of differential fitness in reproduction, where observed data constitute the latest generation shaped by a historical trajectory, as in immune adaptation, antibiotic resistance, and social norm emergence. Existing methods largely conflate these two forms and rely on an identical graphical model of selection. We show that this model is valid for static settings but fails to characterize data under evolution, yielding false discovery results. To address this, we introduce a new model that specifically characterizes evolutionary selection, and develop a sound and complete procedure for identifying such models from data across one or multiple environments or generations. Experimental results validate the method's ability to uncover the relevant mechanisms underlying evolution from data.

2606.05688 2026-06-05 cs.CL cs.AI

Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models

面向路由一致性的混合专家模型量化的值与结构对齐

Hancheol Park, Geonho Lee, Tairen Piao, Tae-Ho Kim

发表机构 * arXiv.org cs.CL(计算机语言学)

AI总结 提出VSRAQ方法,通过值对齐和结构对齐两个互补目标保持量化前后的专家选择行为一致性,减少量化引起的性能下降,无需推理开销。

详情
Comments
8 pages, 1 figure
AI中文摘要

混合专家(MoE)模型通过仅为每个token激活一部分专家来高效扩展基础模型,但大量的专家参数使得量化对于实际部署至关重要。然而,与密集模型不同,MoE模型对路由不稳定性敏感:小的量化引起的扰动可能改变top-$k$专家选择,改变计算路径并降低模型质量。我们提出了面向量化的值与结构路由对齐(VSRAQ),这是一种针对MoE的后训练量化目标,旨在量化下保持量化前的专家选择行为。VSRAQ结合了两个互补目标,共同保持专家选择行为:值对齐,匹配与路由相关的logits或分数;结构对齐,保持专家排序和top-$k$决策边界。通过维持路由一致性,VSRAQ减少了量化引起的性能下降,且不引入任何推理时开销,并可集成到现有量化框架中。在近期MoE基础模型上的实验表明,VSRAQ提高了专家选择一致性,并始终优于仅重建和考虑路由器的基线方法。

英文摘要

Mixture-of-Experts (MoE) models scale foundation models efficiently by activating only a subset of experts for each token, but their large number of expert parameters still makes quantization essential for practical deployment. Unlike dense models, however, MoE models are sensitive to routing instability: small quantization-induced perturbations can change the top-$k$ expert selection, altering the computation path and degrading model quality. We propose Value-and-Structure Routing Alignment for Quantization (VSRAQ), a MoE-specific post-training quantization objective that preserves pre-quantization expert-selection behavior under quantization. VSRAQ combines two complementary objectives that jointly preserve expert-selection behavior: value alignment, which matches routing-relevant logits or scores, and structure alignment, which preserves expert ordering and top-$k$ decision boundaries. By maintaining routing consistency, VSRAQ reduces quantization-induced degradation without introducing any inference-time overhead and can be integrated into existing quantization frameworks. Experiments on recent MoE foundation models show that VSRAQ improves expert-selection consistency and consistently outperforms reconstruction-only and router-aware baselines.

2606.05687 2026-06-05 cs.RO cs.SY eess.SY

Accelerating and Scaling MPC-Guided Reinforcement Learning for Humanoid Locomotion and Manipulation

加速与扩展MPC引导的强化学习在类人机器人行走与操作中的应用

Junheng Li, Liang Wu, Sergio A. Esteban, Lizhi Yang, Ján Drgoňa, Aaron D. Ames

发表机构 * California Institute of Technology(加州理工学院) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文提出了一种基于质心动力学MPC奖励的MPC-RL框架,并开发了并行批处理GPU求解器π^nMPC,以高效实现类人机器人的行走与操作技能。

详情
Comments
8 pages, 5 figures
AI中文摘要

在类人运动控制中,模型预测控制(MPC)提供基于物理的预测和约束处理,而强化学习(RL)通过大规模仿真实现鲁棒的全身技能。然而,在RL内部使用MPC通常需要耗时的问题构建或过高的训练开销,使得此类框架在实践中难以证明其合理性。本文研究了训练时高效的MPC引导方法用于类人机器人行走与操作,称为MPC-RL。我们引入了一种基于质心动力学的MPC奖励公式,在训练时利用MPC轨迹的引导。为了在大规模并行RL中实现这一点,我们开发了π^nMPC,一种并行时域且无需构建的批处理GPU MPC求解器,它直接操作时变动力学以避免高内存使用和预编译。通过多种对比研究和硬件验证,我们发现MPC-RL在行走和操作技能上实现了优越的性能。代码库可在https://github.com/junhengl/mpc-rl获取。

英文摘要

In humanoid motion control, model predictive control (MPC) offers physically grounded prediction and constraint handling, while reinforcement learning (RL) enables robust whole-body skills through large-scale simulation. However, using MPC inside RL often requires time-consuming problem construction or excessive training overhead, making such frameworks difficult to justify in practice. This work studies efficient training-time MPC guidance for humanoid locomotion and manipulation, termed MPC-RL. We introduce a centroidal-dynamics MPC reward formulation that leverages guidance from MPC trajectories in training time. To make this practical in massively parallel RL, we develop $π^n$MPC, a parallel-in-horizon and construction-free batched GPU MPC solver that operates directly on time-varying dynamics to avoid high memory usage and pre-compilation. Through a variety of comparative studies and hardware validations, we have found that MPC-RL achieves superior performance in locomotion and manipulation skills. The code base is available at https://github.com/junhengl/mpc-rl.

2606.05684 2026-06-05 cs.AI

AdaMEM: Test-Time Adaptive Memory for Language Agents

AdaMEM:语言代理的测试时自适应记忆

Yunxiang Zhang, Yiheng Li, Ali Payani, Lu Wang

发表机构 * Yunxiang Zhang(张 Yunxiang) Yiheng Li(李 Yiheng) Ali Payani(Payani Ali) Lu Wang(王 Lu)

AI总结 提出AdaMEM框架,通过混合记忆架构(长期轨迹记忆+动态短期策略记忆)实现测试时自适应,无需在线更新参数,在ALFWorld、WebShop等任务上显著优于静态记忆基线。

详情
Comments
ICML 2026
AI中文摘要

语言代理的一个核心挑战是如何利用过去的经验来适应动态的测试时条件。尽管最近的工作展示了代理记忆机制的潜力,但大多数系统将检索限制在情节启动时。因此,代理被迫依赖静态指导,随着长期任务的展开,这种指导变得越来越不匹配。为了解决这种僵化问题,我们提出了自适应记忆代理(AdaMEM),一种用于代理测试时自适应的新框架。无需在线更新模型参数,AdaMEM通过混合记忆架构自适应代理行为:它维护一个离线收集的原始经验的长期轨迹记忆,同时动态生成短期策略记忆以指导决策。这种机制能够在不同推理时计算水平下实现令牌效率与适应性之间的权衡。实验上,AdaMEM显著优于静态记忆基线,在ALFWorld上相对提升高达13%,在WebShop上提升11%,并在HotpotQA上的代理搜索中持续领先。为了进一步增强这种自适应,我们开发了STEP-MFT,一种逐步记忆微调技术,训练策略从检索到的经验中合成高质量策略,从而获得额外的性能提升。我们的工作为代理记忆建立了一个新的扩展维度,支持在真实世界环境中部署后的持续推理和自我进化。我们的代码可在https://github.com/yunx-z/AdaMEM获取。

英文摘要

A central challenge for language agents is utilizing past experience to adapt to dynamic test-time conditions. While recent work demonstrates the promise of agentic memory mechanisms, most systems restrict retrieval to episode initiation. Consequently, agents are forced to rely on static guidance that becomes increasingly misaligned as long-horizon tasks unfold. To address this rigidity, we propose the Adaptive Memory Agent (AdaMEM), a novel framework for agent test-time adaptation. Without updating model parameters online, AdaMEM adapts agent behavior via a hybrid memory architecture: it maintains a long-term trajectory memory of raw experiences collected offline while generating dynamic short-term strategy memory on-the-fly to guide decision-making. This mechanism enables the trade-off between token efficiency and adaptability across varying inference-time compute levels. Empirically, AdaMEM significantly outperforms static memory baselines, achieving relative gains of up to 13% on ALFWorld and 11% on WebShop, with consistent leading performance extending to agentic search on HotpotQA. To further enhance this adaptation, we develop STEP-MFT, a Step-wise Memory Fine-Tuning technique that trains the policy to synthesize high-quality strategies from retrieved experiences, yielding additional performance gains. Our work establishes a new scaling dimension for agentic memory, supporting continuous reasoning and self-evolution post-deployment in real-world environments. Our code is available at https://github.com/yunx-z/AdaMEM.

2606.05678 2026-06-05 cs.SD cs.AI cs.CR

Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition

超越波形鲁棒性:针对自动语音识别的鲁棒特征-声码器对抗攻击

Yifan Liao, Zongmin Zhang, Zhen Sun, Yuhui Sun, Xinhu Zheng, Xinlei He

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Wuhan University(武汉大学)

AI总结 提出一种基于自监督学习表示和声码器的黑盒对抗攻击方法,通过扰动声学-语音特征而非波形,提高了攻击的可迁移性和对防御的绕过能力。

详情
Comments
11 pages
AI中文摘要

自动语音识别(ASR)系统已广泛用于多语言语音到文本转录。其对对抗攻击的鲁棒性已成为社区的重要课题。现有对抗攻击直接将对抗噪声添加到语音音频中。然而,先前工作表明,现有对抗攻击面临两个限制:它们通常难以迁移到黑盒ASR系统,并且越来越多地被针对输入空间扰动的防御所缓解。在这项工作中,我们提出了一种清洁参考特征-声码器攻击,这是一种基于替代模型的黑盒攻击,将对抗搜索空间从原始波形转移到自监督学习(SSL)表示。为了解决可迁移性限制,我们扰动更具泛化性的声学-语音表示,而不是低层波形样本,减少对替代模型特定波形梯度的依赖,并鼓励对抗扰动跨ASR系统泛化。为了绕过不同的防御,我们将对抗信号从显式的加性波形噪声转移到SSL特征空间扰动,并通过声码器将其重构为类似语音的波形对抗信号,使生成的样本与基于波形的防御不太一致。大量实验表明,当仅在原始Whisper-small作为公开替代模型上优化时,我们的攻击有效迁移到黑盒ASR模型,WER比SOTA基线提高+26.6,同时针对多种训练防御仍保持有效,WER提高+36.2。这些结果揭示了当前ASR鲁棒性评估中的一个盲点。

英文摘要

Automatic speech recognition (ASR) systems have become widely used for multilingual speech-to-text transcription. Their robustness to adversarial attacks has become an important topic for the community. Existing adversarial attacks directly add adversarial noise to the speech audio. However, prior work has shown that existing adversarial attacks face two limitations: they often transfer poorly to black-box ASR systems and are increasingly mitigated by defenses tailored to input-space perturbations. In this work, we propose a Clean-Referenced Feature-Vocoder Attack, a surrogate-based black-box attack that moves the adversarial search space from raw waveforms to self-supervised learning (SSL) representations. To address the transferability limitation, we perturb more generalizable acoustic-phonetic representations rather than low-level waveform samples, reducing dependence on surrogate-specific waveform gradients and encouraging adversarial perturbations that generalize across ASR systems. To bypass different defenses, we shift the adversarial signal from explicit additive waveform noise to SSL feature-space perturbations and reconstruct them through a vocoder into speech-like waveform adversarial signals, making the resulting samples less aligned with waveform-bounded defenses. Extensive experiments show that, when optimized only on raw Whisper-small as a public surrogate model, our attack transfers effectively to black-box ASR models with a +26.6 WER improvement over the SOTA baseline, while also remaining effective against multiple training defenses with a +36.2 WER improvement. These results reveal a blind spot in current ASR robustness evaluation.

2606.05677 2026-06-05 cs.CV cs.AI cs.CL

LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

LongSpace: 从感知到回忆的视频长程空间记忆探索

Shiqiang Lang, Jing Liu, Haoyang He, Peiwen Sun, Yuanteng Chen, Tao Liu, Lan Yang, Longteng Guo, Honggang Zhang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Zhongguancun Academy(中关村学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) The Chinese University of Hong Kong(香港中文大学) Xi’an Jiaotong University(西安交通大学)

AI总结 针对长视频中空间记忆的挑战,提出LongSpace框架,通过分块建模、3D结构线索注入和层级感知记忆实现长程空间推理,并在LongSpace-Bench等基准上验证其有效性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在图像和视频理解方面取得了进展,并且能够处理更长的视觉输入。自动驾驶和机器人导航等长程任务不仅需要识别当前视图,模型还必须记住并检索之前观察到的空间布局、路线、视角变化和物体状态。为了评估这一能力,我们引入了LongSpace-Bench,一个用于长程空间记忆的房间导览视频基准,涵盖场景感知、空间关系和空间记忆。在这项工作中,我们进一步提出了LongSpace,一个用于长视频空间推理的记忆框架。LongSpace将长视频建模为连续的块,将3D结构线索注入早期解码器层,并构建层级感知记忆以进行问题引导的检索。在多个空间推理基准上的实验表明,LongSpace改善了长视频空间理解,进一步证明了显式空间记忆是长程视频MLLMs的关键能力。

英文摘要

Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering scene perception, spatial relations, and spatial memory. In this work, we further propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace models long videos as sequential chunks, incorporates 3D structural cues into early decoder layers, and constructs layer-aware memory for question-guided retrieval. Experiments on multiple spatial reasoning benchmarks show that LongSpace improves long-video spatial understanding, further demonstrating explicit spatial memory as a key capability for long-horizon video MLLMs.

2606.05675 2026-06-05 cs.LG cs.CV

Two-Way Is Better Than One: Bidirectional Alignment with Cycle Consistency for Exemplar-Free Class-Incremental Learning

双向优于单向:基于循环一致性的双向对齐用于无样本类增量学习

Hongye Xu, Bartosz Krawczyk

发表机构 * Chester F. Carlson Center for Imaging Science(切斯特·F·卡勒中心影像科学中心) Rochester Institute of Technology(罗切斯特理工学院)

AI总结 提出BiCyc方法,通过双向投影器对齐和循环一致性目标,解决无样本类增量学习中原型漂移和单向投影偏差问题,减少灾难性遗忘并提升准确率。

详情
Comments
Published as a conference paper at ICLR 2026. 23 pages, 8 figures. Code: https://github.com/HXuSz11/BiCyc_ICLR2026
AI中文摘要

持续学习(CL)旨在使模型在不遗忘先前知识的情况下获取新技能。在无样本类增量学习(EFCIL)中,由于无法存储过去数据,这一挑战被放大,旧类的表示漂移尤其有害。基于原型的EFCIL因其高效性而具有吸引力,但随着嵌入空间的演化,原型会发生漂移;因此,基于投影的漂移补偿已成为一种流行的补救措施。然而,我们表明,现有的单向投影引入了系统性偏差:它们要么追溯性地扭曲当前特征几何结构,要么仅局部对齐旧类,导致跨任务累积的循环不一致性。我们提出BiCyc,一种具有循环一致性目标的双向投影器对齐方法。BiCyc联合优化两个映射(旧到新和新到旧),并采用停止梯度门控,使得传输和表示共同演化。分析表明,循环损失在白化空间中将奇异谱向单位值收缩,并且类均值和协方差的改进传输导致分类对数几率扰动更小,从而保留旧类决策并减轻灾难性遗忘。实验上,在标准EFCIL基准测试中,BiCyc显著减少了遗忘并提高了从头开始设置下的准确率,同时在预训练细粒度场景中保持竞争力。

英文摘要

Continual learning (CL) seeks models that acquire new skills without erasing prior knowledge. In exemplar-free class-incremental learning (EFCIL), this challenge is amplified because past data cannot be stored, making representation drift for old classes particularly harmful. Prototype-based EFCIL is attractive for its efficiency, yet prototypes drift as the embedding space evolves; therefore, projection-based drift compensation has become a popular remedy. We show, however, that existing one-directional projections introduce systematic bias: they either retroactively distort the current feature geometry or align past classes only locally, leaving cycle inconsistencies that accumulate across tasks. We introduce BiCyc, a bidirectional projector alignment approach with a cycle-consistency objective. BiCyc jointly optimizes two maps, old-to-new and new-to-old, with stop-gradient gating so that transport and representation co-evolve. Analytically, we show that the cycle loss contracts the singular spectrum toward unity in whitened space, and that improved transport of class means and covariances yields smaller perturbations of classification log-odds, preserving old-class decisions and mitigating catastrophic forgetting. Empirically, across standard EFCIL benchmarks, BiCyc substantially reduces forgetting and improves accuracy in from-scratch settings, while remaining competitive in the pretrained fine-grained regime.

2606.05671 2026-06-05 cs.CL

QueryAgent-R1: Bridging Query Generation and Product Retrieval for E-Commerce Query Recommendation

QueryAgent-R1:连接查询生成与商品检索的电商查询推荐

Dike Sun, Zheng Zou, Jingtong Zang, Qi Sun, Huaipeng Zhaoand Tao Luo, Xiaoyi Zeng

发表机构 * Alibaba International Digital Commercial Group(阿里巴巴国际数字商业集团)

AI总结 提出QueryAgent-R1框架,通过记忆增强和检索链优化,将查询生成与实际库存检索对齐,以提升电商搜索中查询推荐的产品转化率。

详情
AI中文摘要

电商搜索中的查询推荐旨在主动建议符合用户潜在兴趣的查询。然而,现有方法主要优化查询级别的相关性,而忽略了检索到的产品是否与用户的下游偏好一致。这种不匹配通常导致高查询点击率(CTR)但低产品转化率(CVR)。为了弥合这一差距,我们提出了QueryAgent-R1,一个记忆增强的代理框架,通过检索链优化来改进端到端对齐。我们的QueryAgent-R1将查询生成基于实际库存检索,使代理能够根据检索到的产品验证和优化查询。我们还在代理强化学习(RL)过程中设计了一个一致性奖励,以联合优化查询相关性和下游参与度。此外,我们构建了一个记忆抽象模块用于高效的用户画像。为了支持离线评估,我们基于专有工业数据和公开数据集构建了两个数据集,QueryAgent-R1在这些数据集上持续优于强基线。此外,在一个大规模生产平台上,QueryAgent-R1在在线A/B测试中将查询CTR提高了2.9%,引导CVR提高了3.1%。

英文摘要

Query recommendation in e-commerce search aims to proactively suggest queries that match users' potential interests. However, existing methods mainly optimize query-level relevance, while neglecting whether the retrieved products align with users' downstream preferences. This mismatch often leads to high query click through rates (CTR) but low product conversion rates (CVR). To bridge this gap, we propose QueryAgent-R1, a memory-augmented agentic framework that improves end-to-end alignment via chain-of-retrieval optimization. Our QueryAgent-R1 grounds query generation in real inventory retrieval, allowing the agent to validate and refine queries based on retrieved products. We also design a consistency reward in the agentic reinforcement learning (RL) process to jointly optimize query relevance and downstream engagement. In addition, we construct a memory abstraction module for efficient user profiling. To support offline evaluation, we construct two datasets based on both proprietary industrial data and public datasets, on which QueryAgent-R1 consistently outperforms strong baselines. Moreover, on a large scale production platform, QueryAgent-R1 improves Query CTR by 2.9% and guided CVR by 3.1% in online A/B tests.

2606.05670 2026-06-05 cs.AI

Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

更多智能体有帮助吗?LLM智能体工作流的受控与协议对齐评估

Yuhang Fu, Ruishan Fang, Jiaqi Shao, Huiyu Zheng, Zhengtao Zhu, Bing Luo, Tao Lin

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Westlake University(西湖大学) Zhejiang University(浙江大学) Duke Kunshan University(杜克大学昆山分校) Hong Kong University of Science and Technology(香港科技大学) Zhejiang University of Technology(浙江工业大学)

AI总结 提出BenchAgent框架,在统一协议下比较单智能体、固定多智能体和演化多智能体工作流,发现大多数多智能体系统在准确率上未超越单智能体基线,但运行时生成的工作流在GAIA上表现优异。

详情
Comments
https://github.com/LINs-lab/MASArena/tree/BenchAgent
AI中文摘要

一旦比较的系统共享相同的基准加载器、工具访问、答案契约、使用计数和轨迹日志,添加更多智能体是否有助于LLM工作流?我们引入BenchAgent,一个评估框架,将单智能体、固定多智能体(MAS)和演化MAS工作流置于一个标准化的执行和日志协议下。BenchAgent使用GPT-4.1在十个推理、编码和工具使用基准上评估这些内部工作流,并单独报告运行时生成工作流的协议对齐外部(PAE)GAIA研究。在SI条件下,六个测试的MAS中最多有一个在基准平衡平均准确率上超过匹配的单智能体锚点:EvoAgent位于Wilson单次运行指导范围内,而其余五个落后2.56-11.29个百分点,并占据更昂贵的准确率-成本权衡。在PAE GAIA快照上,一个Claude-Code风格的运行时工作流达到66.72%的整体准确率和69.23%的Level 3准确率,比最强的非Claude基线Jarvis(一个固定MAS)高出20多个百分点。

英文摘要

Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places single-agent, fixed multi-agent (MAS), and evolving MAS workflows under one normalized execution and logging protocol. BenchAgent evaluates these substrate-internal workflows across ten reasoning, coding, and tool-use benchmarks with GPT-4.1, and separately reports a Protocol-Aligned External (PAE) GAIA study of a runtime-generated workflow. Under SI conditions, at most one of six tested MAS exceeds the matched single-agent anchor on benchmark-balanced average accuracy: EvoAgent lies within the Wilson one-run guidance, while the remaining five trail by 2.56-11.29 points and occupy more expensive accuracy-cost trade-offs. On the PAE GAIA snapshot, a Claude-Code-style runtime workflow reaches 66.72% overall and 69.23% on Level 3, more than 20 points above the strongest non-Claude baseline, Jarvis, a fixed MAS.

2606.05669 2026-06-05 cs.RO cs.SY eess.SY

Dynamic Multi-Agent Pickup and Delivery in Robotic Cellular Warehousing Systems

机器人化仓储系统中的动态多智能体取送货

Cheng Ren, Ming Li, Xinping Guan, George Q. Huang

发表机构 * Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University(工业与系统工程系,香港理工大学) School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(自动化与智能感知学院,上海交通大学)

AI总结 针对订单内部SKU动态追加的仓库场景,首次形式化动态多智能体取送货问题,提出两种基于令牌传递的事件触发在线重规划算法,显著降低订单流时间。

详情
AI中文摘要

机器人化仓储系统(RCWS)引发多智能体取送货(MAPD)过程,其中机器人按顺序为每个订单收集多个库存单位(SKU)。与假设静态任务的经典MAPD公式不同,真实仓库操作通常涉及动态订单演变,即在订单执行过程中可能追加新的SKU。受此实际需求驱动,本文首次考虑内部订单演变,形式化了动态多智能体取送货问题。基于令牌传递范式,我们提出了两种事件触发在线重规划算法。第一种,动态令牌传递,通过添加订单分解和基于优先级的令牌调度,在订单更新时执行局部重规划,同时保持无碰撞执行。第二种,协作令牌传递,进一步使空闲机器人能够机会性地协助新添加的取货任务,提高系统级效率。在RCWS环境中的仿真结果表明,与静态和非协作基线相比,所提方法显著减少了订单流时间。

英文摘要

Robotic Cellular Warehousing Systems (RCWS) give rise to multi-agent pickup and delivery (MAPD) processes in which robots sequentially collect multiple stock-keeping units (SKUs) for each order. Unlike classical MAPD formulations that assume static tasks, real warehouse operations often involve dynamic order evolution, where new SKUs may be appended to an order while it is being executed. Motivated by this practical requirement, this letter formulates the Dynamic Multi-Agent Pickup and Delivery problem considering internal order evolution for the first time. Building on the token passing paradigm, we propose two event-triggered online replanning algorithms. The first, Dynamic Token Passing, performs localized replanning upon order updates through add-order decomposition and priority-based token scheduling while preserving collision-free execution. The second, Cooperative Token Passing, further enables idle robots to opportunistically assist newly added pickups, improving system-level efficiency. Simulation results in RCWS environments demonstrate that the proposed methods significantly reduce order flowtime compared with static and non-cooperative baselines.