arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.00606 2026-06-02 cs.CV

FiSeR: Fine-Grained Source Representations for Cross-Domain AI Image Detection

FiSeR：用于跨域AI图像检测的细粒度源表示

Shan Zhang, Yongxin He, Mingming Zhang, Huiwen Tian, Lei Ma

发表机构 * Shan Zhang, Yongxin He, Mingming Zhang, Huiwen Tian, Lei Ma（作者团队）

AI总结针对合成图像检测器在域迁移下泛化能力差的问题，提出层次对比学习框架FiSeR，通过粗粒度和细粒度对比目标联合优化，在跨域评估中平均AUROC提升+10.22。

详情

AI中文摘要

现实世界的合成图像检测器在域内表现强劲，但在域迁移下通常泛化能力差。通过无监督UMAP投影，我们发现自然和合成特征在未见数据集上仍部分可分，但性能仍然下降，表明分类头过度拟合训练域伪影。因此，关键在于学习更具迁移性的表示，使决策标准对域迁移更稳定和鲁棒。基于合成图像由多种生成器生成的结构事实，我们提出一个层次对比学习框架，在保留生成器身份信息的同时提高自然和合成图像之间的可分离性。它联合优化（i）自然和合成图像之间的粗粒度对比目标和（ii）使用生成器身份的合成图像之间的细粒度对比目标。在WildFake上训练，我们的方法在跨域评估中，在与强基线DIRE相同的设置下，在Chameleon、AIGIBench、Community Forensics和GenImage上平均AUROC提升+10.22。对于少样本适应，我们冻结骨干网络，并在每类10个标记样本上拟合SVM头，在12个广泛使用的检测器上平均，AIGIBench的AUROC提升+10.64，Chameleon提升+17.41。我们的代码公开在：https://github.com/heyongxin233/FiSeR。

英文摘要

Real-world synthetic image detectors often generalize poorly under domain shift despite strong in-domain performance. Using unsupervised UMAP projections, we find that natural and synthetic features remain partially separable on unseen datasets, yet performance still drops, suggesting that the classification head overfits to training-domain artifacts. Therefore, the key is to learn more transferable representations so that the decision criterion is more stable and robust to domain shifts. Based on the structural fact that synthetic images are produced by diverse generators, we propose a hierarchical contrastive learning framework that improves the separability between natural and synthetic images while preserving generator identity information. It jointly optimizes (i) a coarse contrastive objective between natural and synthetic images and (ii) a fine contrastive objective among synthetic images using generator identities. Trained on WildFake, our method achieves an average AUROC gain of +10.22 on cross-domain evaluation over Chameleon, AIGIBench, Community Forensics, and GenImage under the same settings as the strong baseline DIRE. For few-shot adaptation, we freeze the backbone and fit an SVM head on 10 labeled samples per class, improving AUROC by +10.64 on AIGIBench and +17.41 on Chameleon, averaged over 12 widely used detectors. Our code is publicly available at: https://github.com/heyongxin233/FiSeR.

URL PDF HTML ☆

赞 0 踩 0

2606.00605 2026-06-02 cs.LG stat.ML

Looped Transformers with Layer Normalization Provably Learn the Power Method

带有层归一化的循环Transformer可证明地学习幂方法

Lyumin Wu, Chenyang Zhang, Yuan Cao

发表机构 * School of Computing & Data Science, The University of Hong Kong（计算与数据科学学院，香港大学）

AI总结本文通过主成分预测任务，证明带有层归一化的循环线性Transformer在梯度下降训练下会收敛到实现幂方法的解，揭示了层归一化带来的算法隐式偏差。

详情

Comments: 70 pages, 8 figures

AI中文摘要

Transformer在广泛的应用中取得了显著成功，越来越多的研究表明其部分优势来自于学习和执行算法程序的能力。然而，我们对Transformer如何学习此类算法的理解仍然有限，尤其是在存在层归一化（LN）的情况下。在这项工作中，我们研究主成分预测作为理解带有LN的Transformer训练动态的具体测试平台。我们证明，通过梯度下降训练的带有LN的循环线性Transformer收敛到实现幂方法的解，其中每个自注意力层执行一次幂迭代。值得注意的是，模型仅针对主成分预测进行训练，而非明确监督其实现幂方法。因此，我们的发现揭示了带有LN的循环Transformer的“算法隐式偏差”：主成分预测原则上可以通过多种机制实现，但梯度下降选择了实现幂方法的一种。我们进一步提供了带有和不带有LN的Transformer之间的具体比较：即使有幂迭代的逐层指导，没有LN的Transformer也无法精确学习幂方法，而带有LN的对应Transformer可以，导致主成分预测中可证明的性能差距。据我们所知，我们的结果首次对带有LN的循环和单层Transformer的训练动态进行了理论分析，并阐明了LN在Transformer模型中的作用。

英文摘要

Transformers have achieved remarkable success across a wide range of applications, and a growing body of work suggests that part of their strength comes from their ability to learn and execute algorithmic procedures. However, our understanding of how transformers learn such algorithms remains limited, especially in the presence of layer normalization (LN). In this work, we study principal component prediction as a concrete testbed for understanding the training dynamics of transformers with LN. We prove that a looped linear transformer with LN, trained by gradient descent, converges to a solution that implements the power method, with each self-attention layer performing one power iteration. Notably, the model is trained only for principal component prediction, rather than being explicitly supervised to implement the power method. Our finding thus reveals an "algorithmic implicit bias" of looped transformers with LN: principal-component prediction can in principle be achieved by many mechanisms, yet gradient descent selects one that realizes the power method. We further provide a concrete comparison between transformers with and without LN: even with layerwise guidance from power iterations, a transformer without LN cannot exactly learn the power method, whereas the corresponding transformer with LN can, leading to a provable performance gap in principal component prediction. Our results provide, to our knowledge, the first theoretical analysis of the training dynamics of looped and single-layer transformers with LN, and shed light on the role of LN in transformer models.

URL PDF HTML ☆

赞 0 踩 0

2606.00602 2026-06-02 cs.CV

ASAP: Advancing Medical Volumetric Representation Learning with Anatomy-aware Semantically-adaptive Pre-training

ASAP: 基于解剖感知语义自适应预训练的医学体素表示学习

Rongsheng Wang, Fenghe Tang, Zihang Jiang, Yingtai Li, Xu Zhang, Haoran Lai, Wenxin Ma, Wei Wei, Zhiyang He, Xiaodong Tao, Rui Yan, Qingsong Yao, Shaohua Kevin Zhou

发表机构 * School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China（生物医学工程学院，生命科学与医学系，中国科学技术大学）； Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE) Lab, YRD-RIGHT, USTC Suzhou Institute for Advanced Research（医学影像、机器人、分析计算与学习（MIRACLE）实验室，YRD-RIGHT，中国科学技术大学苏州研究院）； Jiangsu Provincial Key Laboratory of Multimodal Digital Twin Technology（江苏省多模态数字孪生技术重点实验室）； Biomedical Basic Research Center (BBRC) of Jiangsu Province（江苏省生物医学基础研究中心）； Department of Radiology, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, USTC（放射科，中国科学技术大学第一附属医院，生命科学与医学系，中国科学技术大学）； Anhui IFLYTEK CO., Ltd（安徽科大讯飞股份有限公司）； School of Medicine, Stanford University（医学院，斯坦福大学）； State Key Laboratory of Precision and Intelligent Chemistry, Hefei, Anhui, China（安徽省精密与智能化学重点实验室，合肥，安徽，中国）

AI总结提出ASAP框架，通过解剖感知知识注入、语义自适应对齐与融合，从胸部CT扫描和放射学报告中学习可迁移且可解释的体素表示，在15个数据集和22个下游任务上取得最先进性能。

详情

Comments: MICCAI2025 extention

AI中文摘要

从医学体素扫描中学习可迁移和可解释的表示仍然具有挑战性，因为存在复杂的解剖结构和放射学报告提供的弱、异质监督。在本文中，我们提出了解剖感知语义自适应预训练（ASAP），一个用于从大规模胸部CT扫描及其对应放射学报告中进行细粒度医学体素表示学习的原理性视觉-语言预训练框架。ASAP集成了三个关键组件：（1）解剖感知知识注入模块，通过现成的分割工具融入器官级结构先验，以促进解剖上一致的表示；（2）语义自适应选择性对齐机制，动态地将句子级别的发现与局部体素区域关联；（3）语义自适应融合模块，在双模态掩码建模范式下，实现解剖信息视觉特征与基于文本线索之间的有效交互。除了方法论贡献外，我们还为胸部CT上的医学体素视觉-语言预训练建立了一个全面的基准，涵盖15个数据集和22个下游任务，包括异常分类、分割、疾病预后预测、报告生成、词汇分类、跨模态检索和视觉问答。该基准提供了标准化的评估协议，以系统评估在不同临床设置和数据制度下的表示质量。大量实验表明，ASAP在跨任务和数据集上一致地实现了最先进的性能，在有限监督和分布偏移下尤其显著，验证了其在学习可迁移和临床有意义的体素表示方面的有效性。

英文摘要

Learning transferable and interpretable representations from medical volumetric scans remains challenging due to complex anatomical structures and weak, heterogeneous supervision provided by radiology reports. In this paper, we propose Anatomy-aware Semantically-Adaptive Pre-training (ASAP), a principled vision-language pre-training framework for fine-grained medical volumetric representation learning from large-scale chest CT scans and their corresponding radiology reports. ASAP integrates three key components: (1) an anatomy-aware knowledge injection module that incorporates organ-level structural priors via off-the-shelf segmentation tool to encourage anatomically coherent representations; (2) a semantically-adaptive selective alignment mechanism that dynamically associates sentence-level findings with localized volumetric regions; and (3) a semantically-adaptive fusion module for effective interaction between anatomically informed visual features and grounded textual cues under dual-modal masked modeling paradigm. Beyond methodological contributions, we establish a comprehensive benchmark for medical volumetric vision-language pre-training on chest CT, covering 15 datasets and 22 downstream tasks spanning abnormality classification, segmentation, disease prognosis prediction, report generation, vocabulary classification, cross-modal retrieval and visual question answering. This benchmark provides standardized evaluation protocols to systematically assess representation quality under diverse clinical settings and data regimes. Extensive experiments demonstrate that ASAP consistently achieves state-of-the-art performance across tasks and datasets, with particularly pronounced gains under limited supervision and distribution shift, validating its effectiveness in learning transferable and clinically meaningful volumetric representations.

URL PDF HTML ☆

赞 0 踩 0

2606.00593 2026-06-02 cs.CL cs.AI

SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering

SPADER: 面向多答案问答的逐步同伴优势与多样性感知探索奖励

Qiming Shi, Zhaolu Kang, Yunfan Zhou, Di Weng, Yingcai Wu

发表机构 * State Key Lab of CAD&CG, Zhejiang University（浙江大学CAD与CG国家重点实验室）； School of Software and Microelectronics, Peking University（北京大学软件与微电子学院）； School of Software Technology, Zhejiang University（浙江大学软件技术学院）

AI总结提出SPADER强化学习框架，通过逐步同伴优势（SPA）机制和多样性感知探索奖励，解决多答案问答中长程工具使用的细粒度信用分配与持续探索问题，实验表明在多个数据集上提升了召回率和F1分数。

详情

AI中文摘要

大型语言模型越来越多地被部署为工具增强型智能体，以获取参数知识之外的信息。虽然最近的工作改进了长程工具使用推理，但大多数方法专注于具有单一正确答案的任务。相比之下，许多现实世界中的查询需要发现一组全面的有效答案，这种设置被称为多答案问答。这种设置带来了两个挑战：长搜索轨迹上的细粒度信用分配，以及超越简单高频实体的持续探索的奖励对齐。我们提出了SPADER，一个用于多答案问答中长程工具使用的强化学习框架。SPADER包括逐步同伴优势（SPA），一种无评论家的逐步信用分配机制，它通过决策步骤对齐并行轨迹，并根据同伴回报估计优势。它还包括一个多样性感知探索奖励，通过加权稀有发现和降低冗余发现的权重来促进长尾实体发现。在QAMPARI、Mintaka、WebQSP和QUEST上的实验表明，SPADER通常比基于提示的智能体、结果监督的强化学习方法和最近的逐步监督方法提高了召回率和整体F1分数。我们的代码和模型权重可在https://github.com/KhanCold/spader获取。

英文摘要

Large language models are increasingly deployed as tool-augmented agents to acquire information beyond parametric knowledge. While recent work has improved long-horizon tool-use reasoning, most approaches focus on tasks with a single correct answer. In contrast, many real-world queries require discovering a comprehensive set of valid answers, a setting known as Multi-Answer QA. This setting raises two challenges: fine-grained credit assignment over long search trajectories and reward alignment for sustained exploration beyond easy high-frequency entities. We propose SPADER, a reinforcement learning framework for long-horizon tool use in Multi-Answer QA. SPADER includes Step-wise Peer Advantage (SPA), a critic-free step-level credit assignment mechanism that aligns parallel trajectories by decision step and estimates advantages from peer returns. It also includes a diversity-aware exploration reward that promotes long-tail entity discovery by upweighting rare findings and downweighting redundant ones. Experiments on QAMPARI, Mintaka, WebQSP, and QUEST show that SPADER generally improves recall and overall F1 over prompting-based agents, outcome-supervised RL methods, and recent step-level supervision approaches. Our code and model weights are available at https://github.com/KhanCold/spader.

URL PDF HTML ☆

赞 0 踩 0

2606.00592 2026-06-02 cs.CV

Through the PRISM: Principle-Aware, Interpretable, and Multi-Scale Evaluation of Visual Designs

通过PRISM：原则感知、可解释和多尺度的视觉设计评估

Mona Gandhi, KJ Joseph, Srinivasan Parthasarathy, Sayan Nag

发表机构 * Ohio State University（俄亥俄州立大学）； Adobe Research（Adobe研究院）

AI总结提出PRISM基准和一种多尺度评估框架，通过原则扰动和分层分析实现可解释的设计质量评估。

详情

Journal ref: CVPR 2026 Findings

AI中文摘要

有效的视觉传达源于多个设计原则的和谐，如可读性、对比度、对齐、重叠和连贯性，这些原则共同支配着传达者的清晰度和意图。虽然人类设计师会整体性地考虑这些原则，但机器智能体通常将它们压缩成一个单一的启发式分数，提供有限的可解释性和诊断精度。为了解决这一差距，我们引入了PRISM（原则感知、可解释和结构引导的设计修改），这是一个基准，它沿着可测量的设计原则系统地扰动Crello数据集中的专业布局。该基准包含10万个扰动训练样本和1万个扰动验证设计，每个样本隔离特定的原则违规，以进行关于设计质量的多模态推理的受控分析。我们表明，像Qwen-2.5-VL和GPT-4o-mini这样的模型对有针对性的原则退化在很大程度上不敏感，而GPT-4o表现出全局意识但缺乏细粒度的解耦。基于这些见解，我们提出了一个多尺度评估框架，该框架集成了用于定量评估的轻量级评分器、用于局部反馈的指令调优视觉语言模型以及用于全局推理的基于提示的方法。我们的框架提供了设计失败的可解释解释。利用这些局部见解，我们展示了改善布局质量的有针对性的改进。PRISM和我们的框架共同为可解释的、具有设计素养的多模态推理系统奠定了基础。

英文摘要

Effective visual communication stems from the harmony of multiple design principles, such as readability, contrast, alignment, overlap, and coherence, which collectively govern clarity and intent of the communicator. While human designers reason holistically over these principles, machine agents typically condense them into a single heuristic score, offering limited interpretability and diagnostic precision. To address this gap, we introduce PRISM (PRinciple-aware, Interpretable, and Structure-guided Design Modifications), a benchmark that systematically perturbs professional layouts from the Crello dataset along measurable design principles. The benchmark comprises 100K perturbed training samples and 10K perturbed validation designs, each isolating a specific principle violation for controlled analysis of multimodal reasoning about design quality. We show that models like Qwen-2.5-VL and GPT-4o-mini are largely insensitive to targeted principle degradations, whereas GPT-4o exhibits global awareness without fine-grained disentanglement. Building on these insights, we propose a multi-scale evaluation framework that integrates lightweight scorers for quantitative assessment, instruction-tuned vision-language models for localised feedback, and prompt-based methods for global reasoning. Our framework provides interpretable explanations of design failures. Using these localised insights, we show targeted refinements that improve layout quality. Together, PRISM and our framework lay the foundation for interpretable design-literate multimodal reasoning systems.

URL PDF HTML ☆

赞 0 踩 0

2606.00588 2026-06-02 cs.CV

Response-Aware Multimodal Learning for Post-Treatment Visual Acuity Forecasting

响应感知的多模态学习用于治疗后视力预测

Phuoc-Nguyen Bui, Van-Vi Vo, Duc-Tai Le, Van-Nguyen Pham, Ki-Young Kim, Seung-Young Yu, Hyunseung Choo

发表机构 * Research Convergence Institute（研究融合研究所）； Sungkyunkwan University（全北大学）； Dept. of AI Systems Engineering（人工智能系统工程系）； Dept. of Ophthalmology（眼科系）； Kyung Hee University Medical Center（庆熙大学医学院）； Dept. of Electrical and Computer Engineering（电气与计算机工程系）

AI总结提出ReVA框架，利用基线与第1个月OCT影像及表格数据，通过多模态融合预测糖尿病性黄斑水肿患者抗VEGF治疗后3-24个月的视力轨迹。

详情

Comments: Under review

AI中文摘要

抗VEGF治疗后长期视力（VA）结果对于糖尿病性黄斑水肿（DME）患者的咨询、期望设定和随访计划至关重要。然而，在临床实践中，医生通常仅根据早期治疗后发现来估计长期视力轨迹，使得可靠的预后判断变得困难。尽管先前基于OCT的学习方法主要关注短期反应或单终点预测，但利用早期纵向观测数据建模多个未来时间点的VA轨迹仍未被充分探索。在本研究中，我们收集了一个由188名接受抗VEGF治疗的DME患者组成的真实世界队列，配有配对基线和第1个月OCT扫描，以及表格化的OCT衍生生物标志物和非影像临床变量。仅使用这些早期数据，我们构建了一个多时间点VA预测问题，旨在预测3、6、12、18和24个月的视力结果，反映临床上有意义的随访间隔。我们提出了ReVA，一个响应感知的多模态框架，该框架整合了基线和第1个月OCT的结构特征与表格变量，以捕捉基线疾病状态和早期治疗反应。ReVA使用空间注意力保留局部预后成像特征，并使用依赖感知的表格编码器建模临床变量之间的交互。这些多模态表示被融合以预测患者特定的长期视力轨迹。所提出的框架在24个月VA预测中实现了MAE=0.1246，RMSE=0.1621，R^2=0.6064，并在所有预测时间点上表现一致。我们的研究结果表明，纳入早期治疗反应信号能够实现临床上有意义的长期视力预测，为常规抗VEGF管理中的数据驱动决策支持提供了依据。

英文摘要

Long-term visual acuity (VA) outcomes after anti-VEGF therapy are central to patient counseling, expectation setting, and follow-up planning in diabetic macular edema (DME). However, in clinical practice, physicians must often estimate long-term visual trajectories based only on early post-treatment findings, making reliable prognostication difficult. Although prior OCT-based learning approaches have largely focused on short-term response or single-endpoint prediction, modeling VA trajectories across multiple future time points from early longitudinal observations remains insufficiently explored. In this study, we assembled a real-world cohort of 188 anti-VEGF-treated DME patients with paired baseline and month-1 OCT scans, along with tabular OCT-derived biomarkers and non-imaging clinical variables. Using only these early data, we formulate a multi-horizon VA forecasting problem aimed at predicting visual outcomes at 3, 6, 12, 18, and 24 months, reflecting clinically meaningful follow-up intervals. We propose ReVA, a response-aware multimodal framework that integrates structural features from baseline and month-1 OCT with the tabular variables to capture baseline disease status and early treatment response. ReVA uses spatial attention to preserve localized prognostic imaging features and a dependency-aware tabular encoder to model interactions among clinical variables. These multimodal representations are fused to predict patient-specific long-term visual acuity trajectories. The proposed framework achieves MAE=0.1246, RMSE=0.1621, and R^2=0.6064 for 24-month VA prediction, with consistent performance across all forecast horizons. Our findings show that incorporating early treatment-response signals enables clinically meaningful long-term visual acuity forecasting, supporting data-driven decision support for routine anti-VEGF management.

URL PDF HTML ☆

赞 0 踩 0

2606.00583 2026-06-02 cs.CV cs.AI cs.LG cs.MM

Improving Visual Representation Alignment Generation with GRPO

利用GRPO改进视觉表示对齐生成

Shentong Mo, Sukmin Yun

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Hanyang University（翰阳大学）

AI总结提出VRPO方法，通过强化学习将静态对齐损失替换为生成式表示策略优化目标，动态平衡表示一致性与生成质量，在扩散Transformer中实现更快的收敛和更高的图像保真度。

详情

AI中文摘要

最近的扩散Transformer展示了强大的图像合成能力，但由于生成表示与判别表示之间的弱对齐，训练效率仍然较低。虽然表示对齐框架（如REPA）通过将噪声去噪特征与预训练视觉编码器对齐来改善收敛，但其外部监督的对齐损失是静态的，在训练和推理过程中缺乏自适应性。现有方法依赖于固定的余弦对齐或对比目标，无法动态平衡表示一致性和生成质量，导致判别收益有限，且无法以任务自适应方式优化对齐。为了解决这个问题，我们提出了VRPO，一种基于强化学习的优化策略，用生成式表示策略优化目标取代REPA的静态对齐损失。VRPO不强制执行固定的相似性约束，而是将表示对齐视为一个奖励引导的过程：模型根据生成保真度、感知质量以及扩散特征与预训练视觉嵌入之间的语义一致性获得自适应奖励。这种公式使生成器能够不断优化其内部表示，朝向有语义意义的方向，同时提高图像质量。我们的VRPO驱动训练无缝集成到扩散Transformer中，引入可忽略的计算成本，并保持与SiT和DiT架构的完全兼容性。在ImageNet-256x256上的大量实验表明，我们的VRPO-Alignment显著提高了收敛速度和保真度，在相同计算预算下，与REPA相比，FID提升高达1.8，训练速度加快2.3倍。

英文摘要

Recent diffusion transformers have demonstrated strong image synthesis capabilities but remain inefficient to train due to weak alignment between generative and discriminative representations. While representation alignment frameworks such as REPA improve convergence by aligning noisy denoising features with pretrained visual encoders, their externally supervised alignment loss is static and lacks adaptivity during training and inference. Existing methods rely on fixed cosine alignment or contrastive objectives, which cannot dynamically balance representation consistency and generation quality, resulting in limited discriminative benefit and failing to optimize alignment in a task-adaptive manner. To address this, we propose VRPO, a reinforcement-based optimization strategy that replaces REPA's static alignment loss with a generative representation policy optimization objective. Instead of enforcing a fixed similarity constraint, VRPO treats representation alignment as a reward-guided process: the model receives adaptive rewards based on generation fidelity, perceptual quality, and semantic coherence between the diffusion features and pretrained visual embeddings. This formulation enables the generator to continuously refine its internal representations toward semantically meaningful directions while improving image quality. Our VRPO-driven training seamlessly integrates into diffusion transformers, introducing negligible computation cost and preserving full compatibility with SiT and DiT architectures. Extensive experiments on ImageNet-256x256 demonstrate that our VRPO-Alignment substantially enhances both convergence and fidelity, achieving up to +1.8 FID improvement and 2.3x faster training compared to REPA under identical compute budgets.

URL PDF HTML ☆

赞 0 踩 0

2606.00582 2026-06-02 cs.AI

PropLLM: Propagation-Aware Scene Reconstruction for Network Fault Diagnosis

PropLLM：面向网络故障诊断的传播感知场景重建

Zongzong Wu, Ming Zhao, Fengxiao Tang, Nei Kato

发表机构 * National Natural Science Foundation of China（国家自然科学基金委员会）； High Performance Computing Center of Central South University（中南大学高性能计算中心）

AI总结提出PropLLM，首次将逐跳场景重建范式与LLM生成推理能力结合，通过双知识图谱和时序因果传播注意力机制，从端点告警回溯定位根因并确定故障类型，在真实Wi-Fi多模态故障数据集上诊断准确率提升3.9%，根因定位准确率提升4.7%，幻觉率降低50.8%。

详情

AI中文摘要

网络故障沿着拓扑和协议依赖关系逐层传播，然而运维系统通常只观察到传播链末端的症状告警，此时不同的根因故障可能产生高度相似的端点症状。现有方法（无论是基于规则、机器学习还是大语言模型）本质上都是将告警集一次性映射到诊断结果，在结构上无法解决这种端点歧义性。本文提出PropLLM，首次将逐跳场景重建范式与LLM的生成推理能力相结合。从端点告警出发，PropLLM沿着传播路径逐跳回溯，在每一跳从双层知识图谱中检索可验证的事实证据，同时提出的时序因果传播注意力机制将已知的拓扑因果先验直接编码到注意力计算中，引导模型沿着正确的因果方向前进，最终通过完全基于证据的因果链定位根因并确定故障类型。在真实Wi-Fi多模态故障数据集上，PropLLM的故障类型诊断准确率比最强基线提升3.9%，根因定位准确率提升4.7%，幻觉率降低50.8%。在TeleLogs 5G数据集上的补充实验进一步证明了所提方法在不同网络场景下的有效性。

英文摘要

Network faults propagate layer by layer along topology and protocol dependencies, yet operations systems typically observe only symptomatic alerts at the tail end of propagation chains, where distinct root-cause faults may produce highly similar end-point symptoms. Existing approaches, whether rule-based, machine learning (ML)-based, or large language model (LLM)-based, fundamentally map the alert set to a diagnosis in a single pass and are structurally incapable of resolving this end-point ambiguity. This paper proposes PropLLM, which is the first to integrate the hop-by-hop scene reconstruction paradigm with the generative reasoning capabilities of LLMs. Starting from end-point alerts, PropLLM traces back hop-by-hop along the propagation path, retrieving verifiable factual evidence from a dual-layer knowledge graph (KG) at each hop, while the proposed Temporal Causal Propagation Attention (TCPA) mechanism encodes known topological causal priors directly into the attention computation to guide the model along the correct causal direction, ultimately localizing the root cause and determining the fault type through a fully evidenced causal chain. On a real-world Wi-Fi multimodal fault dataset, PropLLM improves fault type diagnosis accuracy by 3.9\% and root cause localization accuracy by 4.7\% over the strongest baseline, while reducing the hallucination rate by 50.8\%. Supplementary experiments on the TeleLogs 5G dataset further demonstrate the effectiveness of the proposed method across different network scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.00579 2026-06-02 cs.CL cs.CV

Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

沙盒化编码智能体是竞争性的全模态任务求解器

Dongping Chen, Xuanao Huang, Zhihan Hu, Qingyuan Shi, Dianqi Li, Tianyi Zhou

发表机构 * University of Maryland（马里兰大学）； MBZUAI

AI总结本文提出沙盒化编码智能体，仅通过文本+图像访问和工具使用，即可在全模态任务中匹配甚至超越原生全模态模型，并通过技能注入和训练配方Code-X进一步提升性能。

详情

Comments: Paper under review

AI中文摘要

随着多模态大语言模型越来越多地针对视频和音频，人们通常认为这类任务需要原生全模态模型。我们表明情况并非总是如此：仅具有文本+图像访问权限和沙盒化工具使用接口的编码智能体，可以在多个音频-视频基准测试中匹配，并在某些设置中超越最先进的原生全模态模型和预定义的多模态智能体框架。我们的轨迹分析表明，它们的优势来自于编写代码和编排工具，以从转录、帧和其他模态信号中提取相关证据，从而将全模态任务转化为检索和信息处理问题，而不是摄取整个媒体流。我们进一步通过失败分类和过程级轨迹分析来刻画它们的局限性，并表明简单的技能注入（包括人工编写和自蒸馏的技能）能显著提高性能。为了探索开源激发，我们引入了Code-X，一种包含OmniCoding轨迹数据集和可验证奖励的训练方案，并在Qwen-3.5-9B和Qwen-3.6-27B上提供了基线。最后，我们认为下一个前沿是多模态处理，并引入了TerminalBench-O，一个用于现实世界全模态处理任务的过程级基准。代码将在https://github.com/Dongping-Chen/OmniCoding提供。

英文摘要

As multimodal LLMs increasingly target video and audio, it is often assumed that such tasks require native omnimodal models. We show that this is not always the case: coding agents with only text+image access and a sandboxed tool-use interface can match, and in several settings outperform, SOTA native omnimodal models and predefined multimodal agent scaffolds across multiple audio-video benchmarks. Our trajectory analysis suggests that their strength comes from writing code and orchestrating tools to extract relevant evidence from transcripts, frames, and other modality signals, thereby converting omnimodal tasks into retrieval and information-processing problems rather than ingesting entire media streams. We further characterize their limitations through a failure taxonomy and process-level trace analysis, and show that simple skill injection, including human-written and self-distilled skills, substantially improves performance. To explore open-source elicitation, we introduce Code-X, a training recipe with the OmniCoding trajectory dataset and verifiable reward, and provide baselines on Qwen-3.5-9B and Qwen-3.6-27B. Finally, we argue that the next frontier is many-modality processing, and introduce TerminalBench-O, a process-level benchmark for real-world omnimodal processing tasks. Code will be available at https://github.com/Dongping-Chen/OmniCoding.

URL PDF HTML ☆

赞 0 踩 0

2606.00576 2026-06-02 cs.RO

Dynamic Resilient Spatio-Semantic Memory with Hybrid Localization for Mobile Manipulation

面向移动操作的动态弹性时空语义记忆与混合定位

Zhijie Yan, Shufei Li, Ze Zhang, Xin Liu, Yuhang Zheng, Zuoxu Wang

发表机构 * School of Mechanical Engineering and Automation, Beihang University（北京航空航天大学机械工程及自动化学院）； Department of Systems Engineering, City University of Hong Kong（香港城市大学系统工程系）； School of Computing, National University of Singapore（新加坡国立大学计算机学院）

AI总结提出DREAM框架，通过在线构建时空语义体素记忆、冗余感知记忆剪枝和混合定位，实现无预建地图的动态室内移动操作，将长时任务成功率提升至55%-70%。

详情

Comments: Code, CAD model, and real-robot demonstrations are available at https://bjhyzj.github.io/dream-web

AI中文摘要

动态室内环境中的可靠移动操作需要一种场景表示，该表示在环境变化时保持几何一致性、可语义查询且计算量可控。现有系统通常依赖预建地图、静态场景假设或高精度相机位姿，当目标物体被重新放置或位姿估计被修正时，可能导致场景信息过时或错位。本文提出DREAM，一个真实机器人移动操作框架，它集成感知、记忆、定位、导航和操作，在无预建地图的未知室内环境中运行。DREAM通过由LiDAR-惯性-视觉SLAM后端注册的RGB-D观测构建在线时空语义体素记忆。它进一步引入位姿图感知的冗余感知记忆剪枝（RMP），在位姿修正后更新历史观测，同时保持长时观测历史有界。对于目标定位和重新获取，DREAM结合语言条件3D检索、开放词汇图像检测和基于多模态大语言模型的语义验证。在四个动态室内实验室场景中的真实机器人实验表明，DREAM将长时任务成功率从DynaMem的40%-60%提升至55%-70%，同时在各场景中保持0.37-0.63 GB的内存占用和0.43-0.53秒的在线记忆更新时间。

英文摘要

Reliable mobile manipulation in dynamic indoor environments requires a scene representation that remains geometrically consistent, semantically queryable, and computationally bounded as the environment changes. Existing systems often rely on pre-built maps, static-scene assumptions, or highly accurate camera poses, which can lead to stale or misaligned scene information when target objects are relocated or pose estimates are corrected. This paper presents DREAM, a real-robot mobile manipulation framework that integrates perception, memory, localization, navigation, and manipulation in previously unseen indoor environments without a pre-built map. DREAM constructs an online spatio-semantic voxel memory from RGB-D observations registered by a LiDAR-inertial-visual SLAM backend. It further introduces pose-graph-aware Redundancy-Aware Memory Pruning (RMP) to update historical observations after pose corrections while keeping long-horizon observation history bounded. For target localization and reacquisition, DREAM combines language-conditioned 3D retrieval, open-vocabulary image detection, and multimodal large language model based semantic verification. Real-robot experiments in four dynamic indoor laboratory scenes show that DREAM improves long-horizon task success rates from 40%-60% with DynaMem to 55%-70%, while maintaining a memory footprint of 0.37-0.63 GB and an online memory-update time of 0.43-0.53 s across scenes.

URL PDF HTML ☆

赞 0 踩 0

2606.00573 2026-06-02 cs.LG

LASER: Loss-Aware Singular-value Decomposition and Rank Allocation for Efficient Low-Precision Vision-Language Models

LASER: 面向高效低精度视觉-语言模型的损失感知奇异值分解与秩分配

Haiyu Wang, Yutong Wang, Leshu Li, Yihui Ren, Sai Qian Zhang

发表机构 * Tandon School of Engineering, New York University（纽约大学工程学院）； Courant Institute of Mathematical Sciences, New York University（纽约大学数学科学学院）； Brookhaven National Laboratory（布鲁克海文国家实验室）

AI总结提出LASER框架，通过损失感知的奇异值分解和跨层秩分配，结合混合量化方案，实现视觉-语言模型在低精度推理下的高效压缩与加速。

详情

AI中文摘要

视觉-语言模型（VLM）具有强大的多模态推理能力，但其巨大的计算开销和高参数数量使得在资源受限设备上部署面临挑战。低秩分解已成为一种有前景的压缩技术，然而现有方法通常优化局部矩阵重建误差，依赖均匀或启发式的秩分配，并且主要关注注意力投影，而前馈网络尚未得到充分探索。在本文中，我们提出 extit{LASER}（ extbf{L}oss- extbf{A}ware extbf{S}ingular-value d extbf{E}composition and extbf{R}ank allocation），一种面向高效低精度VLM推理的低秩压缩框架。LASER从模型损失的二阶近似推导出曲率加权的SVD目标，并使用Kronecker分解的Fisher信息来引导分解朝向下游性能而非单纯的重建。我们进一步引入基于校准梯度的损失感知跨层秩分配策略，使得跨层的参数预算分配更加有效。最后，我们通过一种结合SVD与量化的混合方案，将低秩压缩扩展到FFN层。评估结果表明，LASER在低精度推理下相比先前工作实现了超过2.3倍的解码加速，同时保持了强大的准确性。

英文摘要

Vision-language models (VLMs) deliver strong multimodal reasoning capabilities, but their large computational cost and high parameter counts make deployment challenging on resource-constrained devices. Low-rank decomposition has emerged as a promising compression technique, yet existing methods often optimize local matrix reconstruction error, rely on uniform or heuristic rank allocation, and focus mainly on attention projections while leaving feed-forward networks underexplored. In this paper, we propose~\textit{LASER} (\textbf{L}oss-\textbf{A}ware \textbf{S}ingular-value d\textbf{E}composition and \textbf{R}ank allocation), a low-rank compression framework for efficient low-precision VLM inference. LASER derives a curvature-weighted SVD objective from a second-order approximation of the model loss and uses Kronecker-factored Fisher information to guide decomposition toward downstream performance rather than reconstruction alone. We further introduce a loss-aware cross-layer rank allocation strategy based on calibration gradients, enabling more effective parameter budgeting across layers. Finally, we extend low-rank compression to FFN layers through a hybrid scheme that combines SVD with quantization. The evaluation results show that LASER achieves more than $2.3\times$ decoding speedup over previous work while preserving strong accuracy under low-precision inference.

URL PDF HTML ☆

赞 0 踩 0

2606.00572 2026-06-02 cs.LG

Spatiotemporal Multi-Task Graph Transformer for Trip-Level Transit Prediction

时空多任务图Transformer用于行程级公交预测

Oluwaleke Yusuf, Adil Rasheed, Frank Lindseth

发表机构 * Department of Engineering Cybernetics, Norwegian University of Science and Technology (NTNU)（工程 cybernetics 部，挪威科学与技术大学（NTNU））； Department of Computer Science, Norwegian University of Science and Technology (NTNU)（计算机科学部，挪威科学与技术大学（NTNU））

AI总结提出SMT-GraphFormer，一种将行程级公交预测建模为序列到序列问题的时空多任务图Transformer，通过图嵌入、上下文编码器和多门专家混合模块，在挪威特隆赫姆公交数据上优于停靠级基线方法。

详情

Comments: 25 pages, 7 figures, 11 tables, including appendix. Code available at https://github.com/Outsiders17711/SMTGraphFormer

AI中文摘要

来自公共交通系统的乘客计数数据揭示了城市出行模式，对于规划、运营和优化至关重要。然而，站点和线路之间的非线性时空相互依赖性使得建模和预测具有挑战性。现有方法通常依赖于固定的时间、空间或站点级公式，限制了它们捕捉行程内演变和网络上下文的能力。本研究提出了SMT-GraphFormer，一种时空多任务图Transformer，将行程级公交预测构建为序列到序列建模。给定一条线路的站点序列和行程级上下文，模型预测连续的上下车人数，并将延误和停靠时间作为编码器侧的辅助任务。关键组件包括用于多关系站点相似性的图嵌入、用于天气和时间信息的上下文编码器，以及一个多门专家混合模块，该模块为上下车预测生成任务特定的解码器表示。对挪威特隆赫姆的公共公交数据进行评估表明，SMT-GraphFormer优于站点级表格基线，消融研究考察了每个组件的贡献。序列化公式在下车预测上取得了显著提升（R²提高+0.24），并在上车、延误和停靠时间上持续改进，证实了显式行程级序列偏差和目标间依赖性的价值。这些发现展示了基于Transformer的序列建模在捕捉公共交通复杂时空动态方面的潜力，并强调了针对公交数据定制的架构相对于现成表格模型的价值。所提出的框架为数字孪生环境中的场景分析提供了与预测范围无关的基础，支持规划者和公交运营商的知情决策。

英文摘要

Passenger count data from public transit systems reveals urban mobility patterns and is essential for planning, operation, and optimisation. However, non-linear spatiotemporal interdependencies across stops and lines make modelling and prediction challenging. Existing approaches often rely on fixed temporal, spatial, or stop-level formulations, limiting their ability to capture within-trip evolution and network context. This study proposes SMT-GraphFormer, a spatiotemporal multi-task graph transformer that frames trip-level transit prediction as sequence-to-sequence modelling. Given a line's stop sequence and trip-level context, the model predicts successive boarding and alighting counts, with delay and dwell time treated as encoder-side surrogate tasks. Key components include graph embeddings for multi-relational stop similarity, a context encoder for weather and temporal information, and a multi-gate mixture-of-experts module that produces task-specific decoder representations for boarding and alighting predictions. Evaluation on public bus transit data from Trondheim, Norway, shows that SMT-GraphFormer outperforms stop-level tabular benchmarks, with ablation studies examining each component's contribution. The sequential formulation yields substantial gains on alighting prediction ($+$0.24 in $R^2$) and consistent improvements on boarding, delay, and dwell, confirming the value of explicit trip-level sequential bias and inter-target dependencies. These findings demonstrate the potential of transformer-based sequence modelling for capturing complex spatiotemporal dynamics in public transit and underscore the value of architectures tailored to transit data rather than off-the-shelf tabular models. The proposed framework provides a horizon-agnostic basis for scenario analysis in digital twin environments, supporting informed decision-making by planners and transit operators.

URL PDF HTML ☆

赞 0 踩 0

2606.00571 2026-06-02 cs.LG cs.AI cs.CV

On the Difficulty of Learning a Meta-network for Training Data Selection

学习用于训练数据选择的元网络的困难性

Zilin Du, Junqi Zhao, Boyang Albert Li

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结针对元学习训练数据选择（MTS）在实践中表现不佳的问题，本文通过数学分析揭示了梯度信噪比低和缺乏信息特征两大障碍，并提出增大批大小和利用信息特征作为解决方案。

详情

AI中文摘要

合成数据越来越多地被用于训练神经网络，但若不加区分地使用，其与真实数据的分布不匹配会限制其有效性。一种常见策略是通过双层优化学习数据权重，我们称之为元学习训练数据选择（MTS）。有趣的是，在实践中，MTS 往往低于预期。我们识别了正确训练 MTS 的两个障碍：梯度信噪比（GSNR）低导致优化困难，以及缺乏与数据质量相关的信息特征。我们对 MTS 进行了数学分析，揭示了归一化数据权重的动态以及不同数据质量与低 GSNR 之间的关系。分析表明，一个简单而有效的解决方案是增大批大小。此外，我们提出了一组信息特征，用于捕捉训练数据在其分布中的位置和训练动态。在四个基准上的实验显示了一致的改进，与无选择的训练相比平均提升 5.49%，与最强基线相比平均提升 2.89%。

英文摘要

Synthetic data are increasingly used to train neural networks, yet distributional mismatch with real data limits their effectiveness when used indiscriminately. A common strategy is to learn data weights via bi-level optimization, which we refer to as Meta-learning for Training-data Selection (MTS). Interestingly, in practice, MTS often performs below expectation. We identify two obstacles in properly training MTS: a poor gradient signal-to-noise ratio (GSNR), which causes optimization difficulties, and lack of informative features that correlates with data quality. We present a mathematical analysis of MTS, which reveals the dynamics of normalized data weights and the relation between disparate data quality and poor GSNR. The analysis suggests a a simple yet effective solution: increasing the batch size. Further, we propose a set of informative features that capture the positions of training data in their distributions and training dynamics. Experiments across four benchmarks show consistent improvements, achieving average gains of 5.49% over training without selection and 2.89% over the strongest baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.00570 2026-06-02 cs.CL cs.AI

Revisiting Parameter-Based Knowledge Editing in Large Language Models: Theoretical Limits and Empirical Evidence

重新审视大型语言模型中基于参数的知识编辑：理论极限与实证证据

Wanying Ren, Xin Song, Futing Wang, Guoxiu He, Aixin Sun

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文通过理论分析和实证评估，揭示了基于参数的知识编辑方法会因维度坍缩假设导致全局干扰和推理崩溃，而简单的检索基线方法在所有条件下均表现更优。

详情

Comments: Accepted to ICML 2026. Equal contribution by the first two authors. 9 pages main paper, 10 figures, with appendix

AI中文摘要

基于参数的知识编辑通过局部权重修改更新大型语言模型（LLMs）的内部知识，并引起了广泛关注。然而，大多数现有方法忽略了基本的理论限制，并且很少在现实的、面向实践的设置下进行评估。在本文中，我们首先基于维度坍缩假设提出理论分析，解释局部参数编辑如何沿着表示空间中的脆弱方向传播，引发全局干扰并最终导致推理崩溃。基于这一见解，我们通过系统变化知识复杂度、编辑次数、评估维度和基线方法进行了全面的实证评估。我们的结果表明，基于参数的编辑方法持续损害LLM的核心能力。相比之下，一个简单的基于检索的基线在所有评估条件下始终比所有参数编辑方法表现更强。这些发现强调，在知识编辑后保持LLM的基本能力应成为未来研究的核心关注点。

英文摘要

Parameter-based knowledge editing updates the internal knowledge of large language models (LLMs) via localized weight modifications and has attracted significant attention. However, most existing methods overlook fundamental theoretical limitations and are rarely evaluated under realistic, practice-oriented settings. In this paper, we first present a theoretical analysis based on the dimensional Collapse Hypothesis, explaining how localized parameter edits can propagate along fragile directions in the representation space, inducing global interference and ultimately causing reasoning collapse. Building on this insight, we conduct a comprehensive empirical evaluation by systematically varying knowledge complexity, number of edits, evaluation dimensions, and baseline methods. Our results show that parameter-based editing methods consistently damage core LLM capabilities. In contrast, a simple retrieval-based baseline achieves consistently stronger performance than all parameter-editing methods across all evaluated conditions. These findings highlight that preserving the fundamental capabilities of LLMs after knowledge editing should be a central concern for future research.

URL PDF HTML ☆

赞 0 踩 0

2606.00566 2026-06-02 cs.LG cs.CL cs.CR

Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models

相同载荷，不同通道：测量使用工具的語言模型中的信任不对称性

Mohammed Sameer Syed, Rozhin Yasaei

发表机构 * University of Arizona（亚利桑那大学）

AI总结本研究提出安全不对称分数（SAS），通过匹配恶意载荷仅改变传递上下文，系统测量了语言模型在不同通道（用户消息、工具元数据、工具输出）中对对抗性内容的脆弱性差异，发现代理原生模型在工具描述通道更脆弱，而通用模型相反，且机制研究表明安全相关表示在深层网络非线性编码。

详情

Comments: 13 pages, 1 figure. Submitted to EMNLP 2026

AI中文摘要

随着语言模型承担代理角色，包括调用外部API、读取工具输出以及执行嵌入在第三方内容中的指令，其攻击面远超用户输入。模型是否以相同方式处理恶意指令（无论其来源）尚未被系统研究。我们引入了安全不对称分数（SAS），通过使用匹配的载荷对（保持恶意文本相同，仅改变传递上下文）来测量模型对对抗性内容的敏感性如何随内容出现在用户消息、工具元数据或工具输出中而变化。在6个生产级LLM和三种攻击家族上的评估发现了一致且信息丰富的不对称性：当对抗性内容通过工具描述而非用户消息传递时，代理原生模型显著更脆弱，而通用模型则相反。当相同内容通过工具输出而非描述传递时，这种不对称性进一步反转，表明模型隐含地将工具元数据视为可信指令，而将工具结果视为普通数据。对Llama 3.3 70B的机制研究表明，安全相关表示在网络的中间到深层因果存在但非线性编码，解释了线性探针为何无法检测到它。这些发现揭示了当前使用工具的模型在处理对抗性内容时存在的系统性、通道依赖的盲点。

英文摘要

As language models take on agentic roles that span calling external APIs, reading tool outputs, and acting on instructions embedded in third-party content, their attack surface expands well beyond what users type. Whether a model treats a malicious instruction the same way regardless of where it arrives has not been systematically studied. We introduce the Safety Asymmetry Score (SAS), which measures how much a model's susceptibility to adversarial content shifts depending on whether that content arrives in the user message, tool metadata, or tool output, using matched payload pairs that keep the malicious text identical and vary only the context of delivery. Evaluated across 6 production LLMs and three attack families, we find a consistent and informative asymmetry: agent-native models are substantially more vulnerable when adversarial content arrives via tool descriptions than via user messages, while general-purpose models show the reverse. This asymmetry further inverts when the same content is delivered through tool outputs rather than descriptions, suggesting models implicitly treat tool metadata as trusted instructions and tool results as ordinary data. A mechanistic study on Llama 3.3 70B reveals that the safety-relevant representation is causally present at mid-to-late network depths but non-linearly encoded, explaining why linear probes fail to detect it. These findings expose a systematic, channel-dependent blind spot in how current tool-using models handle adversarial content.

URL PDF HTML ☆

赞 0 踩 0

2606.00564 2026-06-02 cs.CV cs.CL

Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding

面向视觉-语言推理的分解式在策略蒸馏：引导梯度实现视觉定位

Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom, Ji Woo Hong, Mark Hasegawa-Johnson, Qi Dai, Chong Luo, Chang D. Yoo

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结通过将视觉-语言模型蒸馏损失分解为语言先验和视觉定位两个正交分量，提出视觉梯度引导（VGS）方法动态调整更新方向以优先优化视觉子空间，从而提升小模型在复杂多模态任务中的定位能力。

详情

Comments: ICML 2026 Spotlight

AI中文摘要

虽然在策略蒸馏为训练小型推理模型提供了密集监督，但其在多模态领域的优化动态仍未得到充分探索。在这项工作中，我们通过数学上将损失分解为两个不同的组成部分：语言先验和视觉定位，挑战了视觉-语言模型（VLM）蒸馏的标准整体观点。我们的分析揭示，这些分量的梯度向量几乎正交，表明与教师语言分布对齐的目标在几何上独立于匹配其视觉感知的目标。因此，标准优化被动地遵循一条次优的折衷轨迹，隐式地平衡这两个目标。假设视觉定位是视觉-语言推理的主要瓶颈，我们引入了视觉梯度引导（VGS），一种动态重新定向更新向量以优先考虑视觉子空间的方法。在多个蒸馏设置和复杂多模态基准上的实验结果表明，VGS显著优于标准的在策略蒸馏整体公式，以最小的训练开销实现了卓越的定位能力。

英文摘要

While on-policy distillation offers dense supervision for training small reasoning models, its optimization dynamics in the multimodal domain remain under-explored. In this work, we challenge the standard monolithic view of Vision-Language Model (VLM) distillation by mathematically decomposing the loss into two distinct components: the language prior and visual grounding. Our analysis uncovers that gradient vectors for these components are nearly orthogonal, indicating that the objective of aligning with the teacher's language distribution is geometrically independent from the objective of matching its visual perception. Consequently, standard optimization passively follows a suboptimal compromise trajectory that implicitly balances the two objectives. Hypothesizing that visual grounding constitutes the primary bottleneck for vision-language reasoning, we introduce Visual Gradient Steering (VGS), a method that dynamically reorients the update vector to prioritize the visual subspace. Experimental results on multiple distillation settings and complex multimodal benchmarks demonstrate that VGS significantly outperforms the standard monolithic formulation of on-policy distillation, achieving superior grounding with minimal training overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.00563 2026-06-02 cs.LG cs.AI stat.ML

A Practical Upper Bound on Selection Bias Effects in Medical Prediction Models

医学预测模型中选择偏差影响的一个实用上界

Kara Liu, Maggie Wang, Russ B. Altman

发表机构 * Stanford University（斯坦福大学）

AI总结针对选择偏差导致模型泛化性差的问题，提出在仅部分观测选择机制和目标分布的现实条件下，对目标群体最差模型性能的一个新上界，并通过合成数据和真实数据验证其有效性和实用性。

详情

DOI: 10.1145/3770855.3818112
Comments: 32 pages, 27 figures, will be published at ACM SIGKDD '26

AI中文摘要

选择偏差是真实世界数据中常见且往往不可避免的一个方面，它挑战了机器学习模型的泛化性。当在偏倚数据上训练的模型被部署到更广泛的目标群体时，模型泛化能力差可能导致实际危害，尤其是在医疗保健等高危环境中。这种风险凸显了从业者在部署前可靠评估模型泛化性的需求。然而，现有的预测模型性能的方法依赖于不切实际地访问目标分布或了解导致偏差的选择机制。为了解决这些局限性，我们提出了一个新颖的上界，用于在现实设置下目标群体上的最差模型性能，其中选择机制和目标群体数据仅被部分观测。我们通过在完全合成数据、源自All of Us研究计划的半合成数据以及MIMIC-IV中的真实世界选择偏差上的实验，证明了我们方法的有效性和实际效用。我们的工作提供了一个原则性和实用性的工具，用于估计在原本难以处理的情况下选择偏差的影响，从而使从业者能够在医疗保健及其他领域构建更安全、更具泛化性的模型。

英文摘要

Selection bias is a common and often unavoidable aspect of real-world data that challenges the generalizability of machine learning models. When models trained on biased data are deployed in the broader target population, poor model generalization may lead to real harm, particularly in high-risk settings such as healthcare. This risk highlights the need for practitioners to reliably assess model generalizability prior to deployment. However, existing methods for predicting model performance rely on unrealistic access to the target distribution or knowledge of the selection mechanism causing bias. To address these limitations, we propose a novel upper bound on the worst-case model performance on the target population under the realistic setting where the selection mechanism and the target population data are only partially observed. We demonstrate the validity and practical utility of our method through experiments on fully synthetic data, semi-synthetic data derived from the All of Us Research Program, and real-world selection bias in MIMIC-IV. Our work offers a principled and practical tool to estimate the impact of selection bias in an otherwise intractable setting, thereby enabling practitioners to build safer and more generalizable models in healthcare and beyond.

URL PDF HTML ☆

赞 0 踩 0

2606.00562 2026-06-02 cs.CV cs.LG

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

DeepLatent: 通过并行潜在视觉推理用图像思考

Dongchen Lu, Zhimo Li, Mao Shu, Huo Cao

发表机构 * Baidu Inc.（百度公司）； Peking University（北京大学）

AI总结提出DeepLatent框架，通过LatentFormer并行生成潜在视觉状态，并结合连续空间强化学习优化潜在表示，在多个基准上达到最先进性能。

详情

AI中文摘要

“用图像思考”的新兴范式将视觉状态嵌入中间推理步骤，定义了视觉语言模型的新前沿。现有方法沿两条路线分化。工具辅助方法应用显式视觉操作，但存在高延迟和操作类型受限的问题。潜在推理方法自回归生成隐式视觉状态，但性能不如工具辅助方法，且其潜在标记无法捕获有效的视觉信息。在这项工作中，我们提出DeepLatent，一个用于潜在视觉推理的并行框架。首先，我们引入LatentFormer。它使用可学习的2D标记并行生成上下文条件的潜在状态，将每次视觉更新直接锚定在原始图像特征中。其次，我们设计了一种连续空间强化学习算法。它直接在嵌入空间中优化潜在调制参数，显著提高潜在表示质量。该框架通过知识蒸馏和连续空间强化学习算法进行训练。此外，我们贡献了DeepLatent-180K，一个专为潜在视觉推理定制的大规模数据集。在多个基准上的广泛评估表明，DeepLatent达到了最先进的性能。

英文摘要

The emerging paradigm of "thinking with images" embeds visual states into intermediate reasoning steps, defining a new frontier for Vision-Language Models. Existing approaches diverge along two lines. Tool-assisted methods apply explicit visual operations but suffer from high latency and restricted manipulation types. Latent reasoning methods autoregressively produce implicit visual states, but underperform tool-assisted methods, and their latent tokens fail to capture effective visual information. In this work, we propose DeepLatent, a parallel framework for latent visual reasoning. First, we introduce LatentFormer. It uses learnable 2D tokens to generate context-conditioned latent states in parallel, anchoring every visual update directly in the original image features. Second, we design a continuous-space reinforcement learning algorithm. It optimizes latent modulation parameters directly in the embedding space, significantly improving latent representation quality. The framework is trained via knowledge distillation followed by this continuous-space RL algorithm. Furthermore, we contribute DeepLatent-180K, a large-scale dataset tailored for latent visual reasoning. Extensive evaluations across multiple benchmarks demonstrate that DeepLatent achieves state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2606.00561 2026-06-02 cs.LG cs.AI

Interpretable Policy Distillation for Power Grid Topology Control

可解释的策略蒸馏用于电网拓扑控制

Aleksandra Dmitruka, Karlis Freivalds

发表机构 * University of Latvia, Faculty of Exact Sciences and Technology（拉脱维亚大学，精确科学与技术学院）

AI总结提出一种将深度强化学习策略蒸馏为轻量级决策树/随机森林的方法，在保持性能的同时提升可解释性，并揭示表征偏移。

详情

AI中文摘要

深度强化学习为实时电网运行提供了有前景的途径，但大型神经策略评估成本高、难以在受限硬件上部署，且对操作员不透明。我们探究用于电网拓扑控制的近端策略优化（PPO）智能体能否压缩为紧凑的树基替代模型而不损失运行性能。在Grid2Op的标准14节点环境中，使用面向稳定性的奖励，通过压力聚焦的数据收集在关键高负荷状态下训练PPO教师。然后将策略蒸馏为决策树和随机森林。在保留的验证回合中，两个替代模型在平均奖励和生存时长上均超过教师，而推理成本仅为教师的一小部分。决策树与PPO argmax的动作完全一致率较高，且在其排名靠前的动作中几乎完全一致，同时保持足够小以便直接检查。特征重要性分析揭示了表征偏移：PPO策略主要依赖线路负载信号，而蒸馏树主要由母线拓扑变量驱动。这些结果表明，压力聚焦的蒸馏可以将黑箱神经控制器转换为轻量级、可审计的规则类替代模型，适用于实时部署，同时揭示与确定性动作和拓扑特定泛化相关的风险。

英文摘要

Deep reinforcement learning (RL) offers a promising route to real-time power grid operation, yet large neural policies are costly to evaluate, hard to deploy on constrained hardware, and opaque to operators. We ask whether a Proximal Policy Optimization (PPO) agent for grid topology control can be compressed into compact tree-based surrogates without losing operational performance. A PPO teacher is trained on Grid2Op's standard 14-bus environment with a stability-oriented reward, using stress-focused data collection on critical, high-loading states. The policy is then distilled into a decision tree and a random forest. Across held-out validation episodes, both surrogates exceed the teacher in mean reward and survival length at a fraction of the inference cost. The decision tree shows high exact-action agreement with the PPO argmax and near-complete agreement within its top-ranked actions, while remaining small enough to be inspected directly. Feature-importance analysis reveals a representational shift: the PPO policy relies mainly on line-loading signals, while the distilled tree is driven primarily by bus-topology variables. These results suggest that stress-focused distillation can convert a black-box neural controller into a lightweight, auditable rule-like surrogate suited for real-time deployment, while also surfacing risks tied to deterministic actions and topology-specific generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.00559 2026-06-02 cs.LG cs.AI

Richer Representations for Neural Algorithmic Reasoning via Auxiliary Reconstruction

通过辅助重建实现神经算法推理的更丰富表示

Jiafu Huang, Chao Peng, Chenyang Xu, Zhengfeng Yang, Kecheng Cai, Chenhao Zhang, Yi Wang, Yiwei Gong, Wanqin Zhou, Irene Zheng

发表机构 * sei.ecnu.edu.cn（东华大学信息科学与工程学院）

AI总结提出辅助重建模块和自监督学习变体，增强编码器对输入状态信息的保留和特征间依赖的捕捉，从而提升神经算法推理性能。

详情

Comments: Appeared at AAAI 2026

AI中文摘要

神经算法推理已成为一个热门研究方向。它旨在训练神经网络模仿经典基于规则的算法的逐步行为。更具体地说，此类算法的执行可以抽象为一系列状态，其中每个状态代表执行步骤后的中间结果。训练目标是生成复制底层算法过程的状态序列。该任务的常见框架采用编码器-处理器-解码器架构，其中编码器学习状态的表示，处理器模拟算法步骤，解码器重建输出状态。虽然先前的工作侧重于改进处理器，但编码器在表示学习中的作用很少受到关注。大多数方法依赖简单的MLP编码器，这引发了一个问题：这些表示是否足够信息丰富以支持算法推理。本文研究如何改进神经算法推理的编码器表示。我们提出一个重建模块，旨在从其编码表示中恢复输入状态。这个辅助重建任务鼓励编码器保留关于输入的关键信息。我们证明，在训练过程中加入此任务可以提高现有神经架构在标准基准上的性能。此外，我们观察到当前编码器常常未充分利用状态内特征之间的相关性。为了解决这个问题，我们从自监督学习中汲取灵感，设计了一个增强的辅助任务变体，鼓励编码器捕捉状态内特征依赖。实验结果表明，我们的方法使编码器能够学习更丰富的表示，从而增强现有处理器在算法推理任务上的性能。

英文摘要

Neural algorithmic reasoning has emerged as a popular research direction. It aims to train neural networks to mimic the step-by-step behavior of classical rule-based algorithms. More specifically, the execution of such algorithms can be abstracted as a sequence of states, where each state represents the intermediate outcome after an execution step. The training objective is to generate state sequences that replicate the underlying algorithmic process. A common framework for this task adopts an encoder-processor-decoder architecture, where the encoder learns representations of states, the processor simulates algorithmic steps, and the decoder reconstructs output states. While prior work has focused on improving the processor, the role of the encoder in representation learning has received little attention. Most methods rely on simple MLP encoders, raising the question of whether such representations are sufficiently informative for supporting algorithmic reasoning. This paper investigates how to improve encoder representations for neural algorithmic reasoning. We propose a reconstruction module that aims to recover the input state from its encoded representation. This auxiliary reconstruction task encourages the encoder to retain critical information about the input. We demonstrate that incorporating this task during training improves the performance of existing neural architectures on standard benchmarks. Furthermore, we observe that current encoders often underutilize the correlations among features within a state. To address this, we draw inspiration from self-supervised learning and design an enhanced variant of the auxiliary task that encourages the encoder to capture intra-state feature dependencies. Experimental results show that our method enables the encoder to learn richer representations, thereby enhancing the performance of existing processors on algorithmic reasoning tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.00558 2026-06-02 cs.LG

Semi-Supervised Noise Adaptation: Transferring Knowledge from Noise Domain

半监督噪声适应：从噪声域迁移知识

Yuan Yao, Jin Song, Huixia Li, Tongtong Yuan, Jiaqi Wu, Yu Zhang

发表机构 * arXiv.org ； cs.LG（计算机学习）

AI总结提出半监督噪声适应（SSNA）问题，利用合成噪声域作为源域，通过噪声适应框架（NAF）改善目标域的泛化性能。

详情

Comments: Accepted by ICML 2026

AI中文摘要

迁移学习旨在通过从源域迁移知识来促进目标域的学习。源域通常包含语义上有意义的样本（例如图像），以促进有效的知识迁移。然而，最近的一项研究观察到，由简单分布（例如高斯分布）构建的噪声域可以在半监督设置中作为替代源域，其中只有一小部分目标样本被标记，而大多数样本未标记。基于这一令人惊讶的观察，我们提出了一种称为半监督噪声适应（SSNA）的新问题，旨在利用合成噪声域来提高目标域的泛化能力。为了解决这个问题，我们首先建立了一个泛化界，描述了噪声域对泛化的影响，基于此我们提出了噪声适应框架（NAF）。大量实验表明，NAF有效地利用噪声域来收紧目标域的泛化界，从而提高了性能。代码可在 https://github.com/AIResearch-Group/SSNA 获取。

英文摘要

Transfer learning aims to facilitate the learning of a target domain by transferring knowledge from a source domain. The source domain typically contains semantically meaningful samples (*e.g.*, images) to facilitate effective knowledge transfer. However, a recent study observes that the noise domain constructed from simple distributions (*e.g.*, Gaussian distributions) can serve as a surrogate source domain in the semi-supervised setting, where only a small proportion of target samples are labeled while most remain unlabeled. Based on this surprising observation, we formulate a novel problem termed *Semi-Supervised Noise Adaptation* (SSNA), which aims to leverage a synthetic noise domain to improve the generalization of the target domain. To address this problem, we first establish a generalization bound characterizing the effect of the noise domain on generalization, based on which we propose a Noise Adaptation Framework (NAF). Extensive experiments demonstrate that NAF effectively leverages the noise domain to tighten the generalization bound of the target domain, leading to improved performance. The codes are available at https://github.com/AIResearch-Group/SSNA.

URL PDF HTML ☆

赞 0 踩 0

2606.00556 2026-06-02 cs.CV

Improving Visual Grounding in Remote Sensing via Cluster-Guided Refinement and Model Ensemble Voting

通过聚类引导精炼和模型集成投票改进遥感中的视觉定位

Panav Shah, Geet Sethi, Ashutosh Gandhe

发表机构 * Indian Institute of Technology Bombay（印度理工学院班加罗尔）

AI总结提出两种视觉定位流程（SGR和CGR），结合遥感专用模型RemoteSAM和通用分割模型SAM3，并通过多模型集成投票提升定位精度。

详情

Comments: Accepted at CVPR 2026 Workshop MORSE

AI中文摘要

视觉定位旨在定位与自然语言描述对应的图像区域，是可解释视觉系统的关键组成部分。在遥感图像中，由于场景复杂、目标小且尺度变化大，定位尤为困难。依赖单一模型往往不足以应对这些多样化的挑战。在这项工作中，我们提出了两种定位流程，即序列定位精炼（SGR）和聚类感知定位精炼（CGR），它们结合了专门用于遥感的视觉定位模型RemoteSAM和强大的通用分割模型SAM3的互补优势。我们的方法首先使用RemoteSAM获得目标位置的初始估计，然后使用SAM3进行精炼，以产生更准确且空间一致的分割。此外，我们探索了一种基于六个不同能力的定位流程的多数投票集成策略。这种多模型框架提高了鲁棒性，并显著提升了定位精度。实验结果表明，所提出的流程和集成方法优于单个模型，从而产生更可靠和精确的视觉定位预测。

英文摘要

Visual grounding aims to locate image regions that correspond to natural language descriptions and is a key component of interpretable vision systems. In remote sensing imagery, grounding is particularly challenging due to complex scenes, small objects, and large variations in scale. Relying on a single model is often insufficient to address these diverse challenges. In this work, we propose two grounding pipelines, Sequential Grounding Refinement (SGR) and Cluster-Aware Grounding Refinement (CGR), that combine the complementary strengths of RemoteSAM, a visual grounding model specialized for remote sensing, and SAM3, a powerful general-purpose segmentation model. Our approach first uses RemoteSAM to obtain an initial estimate of object location, which is then refined using SAM3 to produce more accurate and spatially consistent segmentations. Additionally, we explore an ensemble strategy based on majority voting across six diverse grounding pipelines, each with distinct capabilities. This multi-model framework improves robustness and significantly enhances localization accuracy. Experimental results demonstrate that the proposed pipelines and ensemble approach outperform individual models, leading to more reliable and precise visual grounding predictions.

URL PDF HTML ☆

赞 0 踩 0

2606.00548 2026-06-02 cs.CV cs.AI cs.LG

CAFOSat: A Strongly Annotated Dataset for Infrastructure-Aware CAFO Mapping Using High-Resolution Imagery

CAFOSat：用于基于高分辨率影像的基础设施感知型CAFO制图的高质量标注数据集

Oishee Bintey Hoque, Nibir Chandra Mandal, Mandy L Wilson, Samarth Swarup, Madhav Marathe, Abhijin Adiga

发表机构 * University of Virginia（弗吉尼亚大学）； Biocomplexity Institute, University of Virginia（弗吉尼亚大学生物复杂性研究所）

AI总结针对集中式动物饲养操作（CAFO）大规模制图困难，提出CAFOSat数据集，集成高分辨率NAIP影像与多源CAFO清单，通过人机协同标注、GradCAM定位和几何聚类优化弱定位记录，并引入合成增强管道，实现基础设施级标注和鲁棒分类。

详情

Comments: Accepted at CVPR Workshop-2026. First two authors has equal contribution

AI中文摘要

集中式动物饲养操作（CAFO）在农业生产中发挥重要作用，但也与环境、公共卫生和疾病监测问题相关。由于基础设施布局异质、位置记录噪声大、标注不一致以及清单不完整，从遥感影像大规模制图CAFO仍具挑战。我们引入CAFOSat，一个用于美国全境CAFO制图的高质量标注、基础设施感知数据集。CAFOSat集成高分辨率国家农业影像计划（NAIP）影像与跨州收集的多源CAFO清单，并通过结合AI辅助标注、基于GradCAM的定位和几何聚类的人机协同管道，将弱地理定位记录转化为精细标注。为提高数据集质量，我们利用土地覆盖引导采样和空间排除约束筛选具有挑战性的负样本，并通过人工验证提供基础设施级标注，包括畜棚、粪池和放牧相关特征。最终数据集包含超过45,000个图像块，覆盖20个州和四大CAFO类别。我们对多种卷积、基于Transformer和视觉-语言模型进行基准测试，证明了精细标注和精心筛选的负样本在CAFO分类和泛化中的价值。此外，我们引入一个合成增强管道，生成基础设施感知的变体以增加训练多样性并提升分布偏移下的鲁棒性。CAFOSat为推进基础设施感知的农业监测和基于高分辨率遥感影像的CAFO制图提供了大规模基准。

英文摘要

Concentrated Animal Feeding Operations (CAFOs) play an important role in agricultural production but are also associated with environmental, public health, and disease surveillance concerns. Large-scale mapping of CAFOs from remote sensing imagery remains challenging due to heterogeneous infrastructure layouts, noisy location records, inconsistent annotations, and incomplete inventories. We introduce CAFOSat, a strongly annotated, infrastructure-aware dataset for CAFO mapping across the United States. CAFOSat integrates high-resolution National Agriculture Imagery Program (NAIP) imagery with multi-source CAFO inventories collected across multiple states and transforms weak geolocation records into refined annotations through a human-in-the-loop pipeline combining AI-assisted annotation, GradCAM-based localization, and geometric clustering. To improve dataset quality, we curate challenging negative samples using land-cover-guided sampling with spatial exclusion constraints and provide infrastructure-level annotations, including barns, manure ponds, and grazing-related features, through manual verification. The resulting dataset contains more than 45,000 image patches spanning 20 states and four major CAFO categories. We benchmark a diverse set of convolutional, transformer-based, and vision-language models, demonstrating the value of refined annotations and curated negative samples for CAFO classification and generalization. In addition, we introduce a synthetic augmentation pipeline that generates infrastructure-aware variations to increase training diversity and improve robustness under distribution shifts. CAFOSat provides a large-scale benchmark for advancing infrastructure-aware agricultural monitoring and CAFO mapping from high-resolution remote sensing imagery.

URL PDF HTML ☆

赞 0 踩 0

2605.01797 2026-06-02 cs.AI

Neural Decision-Propagation for Answer Set Programming

面向回答集编程的神经决策传播

Thomas Eiter, Katsumi Inoue, Sota Moriyama

发表机构 * Vienna University of Technology (TU Wien)（维也纳技术大学（ TU Wien））； National Institute of Informatics（日本信息处理学会）； The Graduate University for Advanced Studies, SOKENDAI（高级研究大学，SOKENDAI）

AI总结提出决策传播（DProp）方法及其可微扩展神经决策传播（NDProp），通过交替假决策和真传播高效计算稳定模型，提升神经符号推理的可扩展性和准确性。

详情

Comments: This is the full version (with appendix) of a paper appearing at the 35th International Joint Conference on Artificial Intelligence (IJCAI-ECAI 2026)

AI中文摘要

将回答集编程（ASP）与神经网络集成已成为神经符号AI中一种有前景的工具。虽然现有方法将ASP的能力扩展到现实世界领域，但其推理流程依赖于经典求解器，这成为可扩展性的瓶颈。为解决这一问题，我们提出了一种计算稳定模型的新方法，称为决策传播（DProp），它交替进行假决策和真传播。我们证明了成功的DProp计算能够捕捉稳定模型语义。随后，我们开发了神经决策传播（NDProp），它是DProp的可微扩展，使用神经计算进行决策，使用模糊评估进行传播。我们评估了NDProp在学习决策启发式以及神经符号集成方面的能力，并将其与现有的神经符号方法进行了比较。结果表明，NDProp能够学习高效计算稳定模型，并在神经符号基准测试中提高了准确性和可扩展性。

英文摘要

Integration of Answer Set Programming (ASP) with neural networks has emerged as a promising tool in Neuro-symbolic AI. While existing approaches extend the capabilities of ASP to real world domains, their reasoning pipelines depend on classical solvers, which is a bottleneck for scalability. To tackle this problem, we propose a new method to compute stable models, called decision-propagation (DProp), which alternates falsity decisions and truth propagations. Successful DProp computations are shown to capture the stable model semantics. We then develop Neural DProp (NDProp), a differentiable extension of DProp with neural computation for decisions and fuzzy evaluation for propagations. We evaluate the capabilities of NDProp for learning decision heuristics as well as neuro-symbolic integration, and compare it with existing neuro-symbolic approaches. The results show that NDProp can learn to efficiently compute stable models, and it improves accuracy and scalability on neuro-symbolic benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.00547 2026-06-02 cs.CL

Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents

学习检索：面向文本到SQL代理的双层长期记忆

Yibo Wang, Nikki Lijing Kuang, Philip S. Yu, Zhewei Yao, Yuxiong He

发表机构 * University of Illinois Chicago（伊利诺伊大学芝加哥分校）； Snowflake AI Research（Snowflake AI研究院）

AI总结提出MERIT框架，通过强化学习优化双层记忆检索（全局策略与局部决策），提升交互式文本到SQL代理的成功率并减少交互轮次。

详情

AI中文摘要

交互式文本到SQL代理通过多轮交互解决数据库任务，涉及模式探索、查询执行、反馈解释和决策修订。长期记忆帮助代理重用过去经验，但现有检索方法仍有局限。静态方法依赖固定的相似性启发式，无法优化下游效用；动态方法通常从稀疏的最终结果中学习，并在单一决策水平上检索记忆。当记忆有用性随交互阶段变化时，这种方法是 insufficient 的，因为用于初始规划的记忆可能不同于局部、状态条件执行所需的记忆。我们提出MERIT，一种动态多水平记忆检索框架。MERIT维护用于全局策略指导的片段级记忆和用于局部决策支持的轮次级记忆。两个水平都使用通过强化学习优化的学习检索策略。为了在有限的中间监督下训练轮次级检索，MERIT使用轻量级过程奖励模型为局部记忆选择提供密集的代理奖励。在BIRD-Interact上的实验表明，MERIT在成功率上优于无记忆、静态检索和动态检索基线，同时减少了平均交互轮次。在Spider2-Snow上的迁移结果进一步显示了无需基准特定调优的跨基准正迁移。这些结果表明，多水平检索改善了交互式文本到SQL代理中的经验重用。

英文摘要

Interactive text-to-SQL agents solve database tasks through multi-turn interactions involving schema exploration, query execution, feedback interpretation, and decision revision. Long-term memory helps agents reuse past experiences, but existing retrieval methods remain limited. Static methods rely on fixed similarity heuristics that do not optimize downstream utility, while dynamic methods often learn from sparse final outcomes and retrieve memories at a single decision horizon. This is insufficient when memory usefulness changes across interaction stages, since memories useful for initial planning may differ from those needed for local, state-conditioned execution. We propose MERIT, a dynamic multi-horizon memory retrieval framework. MERIT maintains episode-level memory for global strategic guidance and turn-level memory for local decision support. Both levels use learned retrieval policies optimized with reinforcement learning. To train turn-level retrieval despite limited intermediate supervision, MERIT uses a lightweight Process Reward Model to provide dense proxy rewards for local memory selection. Experiments on BIRD-Interact show that MERIT outperforms no-memory, static-retrieval, and dynamic-retrieval baselines in success rate while reducing average interaction turns. Transfer results on Spider2-Snow further show positive cross-benchmark transfer without benchmark-specific tuning. These results suggest that multi-horizon retrieval improves experience reuse in interactive text-to-SQL agents.

URL PDF HTML ☆

赞 0 踩 0

2606.00545 2026-06-02 cs.LG

The Assistant as a Privileged Persona: A canonical reference in cross-persona self-recognition

助手作为特权角色：跨角色自我识别中的规范参考

Asvin G

发表机构 * Institute for Advanced Study, Princeton（普林斯顿高级研究院）

AI总结本文研究后训练语言模型在跨角色作者身份判断中的表现，发现助手角色作为规范参考，其熵信号和角色向量距离紧密耦合，且这种耦合仅对助手角色成立。

详情

Comments: Project out of Anthropic Fellows

AI中文摘要

后训练语言模型能够从上下文中的一两句话识别自己的输出。在配套论文 \citep{jack2026twomodes} 中，我们展示了它们还能通过助手模式生成的尖锐熵降来识别当前是否在策略上行动。这两个信号都与后训练主要塑造的助手角色相关。本文将框架扩展到 Llama-3.1-70B-Instruct 上的跨角色作者身份判断。我们测量了一个由评估者和生成者角色（从图书管理员到龙到莎士比亚）组成的面板上的作者身份声称率矩阵，并提出两个主张。首先，在助手自己的矩阵行上，助手的声称率、激活空间中与助手的角色向量距离，以及助手对某个角色文本的惊讶与该角色对自己文本的惊讶之间的熵差，三者紧密耦合。这扩展了配套论文中“行动”的熵特征，使之成为“已行动”的回顾性特征。其次，这种耦合在助手行之外失效：熵差的自然对称扩展不能预测独特评估者（海盗、龙、莎士比亚）的作者身份；起作用的是非对称的——评估者与助手对同一文本的惊讶比较，而非与生成者的比较。我们通过尝试许多候选替代角色排除了任何其他角色都能扮演这一参考角色的可能性。我们将这种非对称性解释为模型在执行隐式贝叶斯似然比检验，以助手作为规范备择假设，而 \citet{chen2025persona} 的角色向量几何（每个角色都是助手的一个增量）确保了助手是唯一普遍可被该检验访问的角色。

英文摘要

Post-trained language models can recognize their own outputs from a sentence or two out of context. In a companion paper \citep{jack2026twomodes} we showed they can also recognize when they are currently acting on-policy, through the sharp entropy drop of assistant-mode generation. Both signals are tied to the Assistant persona that post-training mainly shapes. This paper widens the frame to cross-persona authorship judgement on Llama-3.1-70B-Instruct. We measure a matrix of authorship claim rates over a panel of evaluator and generator personas spanning librarian to dragon to Shakespeare, and make two claims. \emph{First}, on the Assistant's own row of the matrix, the Assistant's claim rate, the persona-vector distance from the Assistant in activation space, and the entropy gap between the Assistant's surprise on a persona's text and the persona's surprise on its own text are all tightly coupled. This extends the entropy signature of \emph{acting} from the companion paper to a retrospective signature of \emph{having acted}. \emph{Second}, this coupling fails off the Assistant's row: the natural symmetric extension of the entropy gap does not predict authorship for distinctive evaluators (pirate, dragon, Shakespeare); what does is asymmetric -- the evaluator's surprise compared to the Assistant's surprise on the same text, not to the generator's. We rule out the alternative that any persona could play this reference role by trying many candidate substitutes; none does. We interpret the asymmetry as the model performing an implicit Bayesian likelihood-ratio test against the Assistant as the canonical alternative hypothesis, with the persona-vector geometry of \citet{chen2025persona} (every persona a delta off the Assistant) ensuring that the Assistant is the only persona universally accessible to that test.

URL PDF HTML ☆

赞 0 踩 0

2606.00544 2026-06-02 cs.LG cs.CL

Escaping the Mode Lottery: Multi-Response Training Improves Language Model Generalization

逃离模式抽彩：多响应训练提升语言模型泛化能力

Hasan Amin, Kian Ahrabian, Ming Yin, Rajiv Khanna

发表机构 * Department of Computer Science, Purdue University（计算机科学系，普渡大学）

AI总结本文提出多响应训练（MRT）方法，通过保留每个提示的多个有效响应来缓解传统单响应微调导致的“模式抽彩”问题，并从统计角度揭示了其提升分布泛化的原理和适用条件。

详情

AI中文摘要

现代语言模型微调通常为每个提示配对单个响应，尽管许多提示允许多个有效补全。这实际上将多模态条件分布简化为单样本视图，我们称之为“模式抽彩”现象，其中训练强调一部分合理模式而忽略其他模式。我们研究了多响应训练（MRT），该方法保留每个提示的多个响应，并建立了关于何时以及为何有帮助的原则性解释。我们的关键见解是，提示和响应是不同的统计资源：额外的提示减少输入分布的不确定性，而额外的响应减少条件输出分布的不确定性。这产生了方差-预算权衡，预测了何时保留多个响应是有价值的，显示了随着提示级不确定性占主导地位而收益递减，并解释了为什么大型冗余语料库可以表现出隐式的多响应效应。我们进一步分析了响应选择，并表明Random-K-of-N是分布微调的无偏默认选择，基于奖励的选择可能导致模式坍缩，而子模质量-多样性目标提供了一种具有理论保证的高效替代方案。受控模拟验证了预测的方差和选择效应，包括一个惊人的失败模式，其中仅奖励选择产生的梯度与真实目标不一致。在结构化和真实世界数据集上，包括一个新的多提示、多响应基准，MRT一致地改善了分布泛化，在响应多样性高、提示冗余性低的场景中收益最大。MRT将响应多重性重新定义为数据分配问题，并提供了明确的指导：当响应廉价且多样时，保留多个响应不是启发式方法，而是基于统计的选择。

英文摘要

Modern language-model fine-tuning typically pairs each prompt with a single response, even though many prompts admit multiple valid completions. This effectively reduces a multi-modal conditional distribution to a one-sample view, a phenomenon we call the "mode lottery," where training emphasizes a subset of plausible modes while leaving others underrepresented. We study multi-response training (MRT), which retains multiple responses per prompt, and develop a principled account of when and why it helps. Our key insight is that prompts and responses are distinct statistical resources: additional prompts reduce uncertainty about the input distribution, while additional responses reduce uncertainty about the conditional output distribution. This yields a variance-budget tradeoff that predicts when retaining multiple responses is worthwhile, shows diminishing returns as prompt-level uncertainty dominates, and explains why large redundant corpora can exhibit an implicit multi-response effect. We further analyze response selection, and show that Random-K-of-N is the unbiased default for distributional fine-tuning, reward-based selection can induce mode collapse, and a submodular quality-diversity objective provides an efficient alternative with theoretical guarantees. Controlled simulations validate the predicted variance and selection effects, including a striking failure mode where reward-only selection produces gradients misaligned with the true objective. Across structured and real-world datasets, including a new multi-prompt, multi-response benchmark, MRT consistently improves distributional generalization, with the largest gains in high response-diversity, low prompt-redundancy regimes. MRT reframes response multiplicity as a data-allocation problem with clear guidance: when responses are cheap and diverse, keeping more than one is not a heuristic, but a statistically grounded choice.

URL PDF HTML ☆

赞 0 踩 0

2606.00543 2026-06-02 cs.CV

ETC: Extreme Token Compression via Task-aware Visual Information Distillation in VLMs

ETC: 通过任务感知的视觉信息蒸馏实现视觉语言模型中的极端令牌压缩

Yiling Gao, Hongchen Wei, Zhenzhong Chen

发表机构 * School of Remote Sensing and Information Engineering, Wuhan University（武汉大学遥感与信息工程学院）

AI总结提出ETC框架，基于变分信息蒸馏原理，在减少输入令牌数量时最小化任务损失，通过文本-图像交叉注意力加权视觉特征并引入变分信息蒸馏，实现单令牌压缩下仍保持强任务性能。

详情

AI中文摘要

在视觉语言模型（VLM）中，高分辨率图像会产生大量视觉令牌，导致推理时的高计算成本和KV缓存开销。为解决此问题，我们提出极端令牌压缩（ETC）框架，基于变分信息蒸馏原理，在减少输入令牌数量时最小化任务损失。具体而言，从信息论角度，我们表明最小化任务损失需要紧凑表示保留用于预测的指令感知充分统计量。在实践中，ETC利用文本-图像交叉注意力加权原始视觉特征以近似潜在的指令感知预测统计量。此外，ETC引入变分信息蒸馏，使紧凑表示保留必要信息以恢复该预测统计量。在LLaVA-1.5-7B和Qwen3-VL-2B上的实验表明，即使在单令牌压缩下，ETC仍保持有效性，大幅减少KV缓存开销同时保留强任务性能。

英文摘要

In Vision-Language Models (VLMs), high-resolution images produce a large number of visual tokens, resulting in high computational costs and KV-cache overhead during inference. To address this problem, we propose an Extreme Token Compression (ETC) framework that minimizes task loss when reducing the number of input tokens based on the principle of variational information distillation. Specifically, from an information-theoretic perspective, we show that minimizing task loss requires the compact representation to preserve the instruction-aware sufficient statistic of the task-relevant visual information for prediction. In practice, ETC leverages text-to-image cross-attention to weight the original visual features to approximate the latent instruction-aware predictive statistic. Moreover, ETC introduces a variational information distillation, enabling the compact representation to preserve the essential information to recover this predictive statistic. Experiments on LLaVA-1.5-7B and Qwen3-VL-2B show that ETC remains effective even under single-token compression, substantially reducing KV-cache overhead while retaining strong task performance.

URL PDF HTML ☆

赞 0 踩 0

2606.00537 2026-06-02 cs.RO

PACE: Phase-Aware Chunk Execution for Robot Policies with Action Chunking

PACE: 面向动作分块策略的相位感知分块执行方法

Junnan Nie, Jiayi Li, Jiachen Zhang, Junyi Lao, Chenghao Liu, Tianle Zhang, Songfang Huang

发表机构 * Peking University（北京大学）； JD Explore Academy（京东探索研究院）

AI总结提出PACE方法，通过在线预测动作分块中的低速过渡点作为重规划边界，自适应选择执行步长，无需重新训练即可提升机器人策略成功率。

详情

Comments: 21 pages, 7 figures, 6 tables. Preprint

AI中文摘要

最近的视觉-语言-动作和基于扩散的机器人策略通常使用动作分块，其中每次策略查询预测一系列未来动作，机器人执行一个开环前缀后再重新查询。虽然这种接口改善了局部运动连续性，但部署时仍需选择执行步长：在获取新观测之前应执行每个预测分块的多少。然而，我们的实验表明，成功率强烈依赖于任务且相对于执行步长非单调，这使得单一恒定步长成为不可靠的部署规则。我们提出PACE（相位感知分块执行），一种无需训练的测试时执行方法，从预测分块本身在线选择执行步长。PACE通过识别预测速度剖面中的低速过渡点，利用操作轨迹的相位相关运动学结构，将其作为候选重规划边界。由于PACE仅使用预测的动作分块，因此即插即用，无需重新训练或访问策略内部。我们通过在仿真和真实机器人环境中的大规模评估验证了PACE。在50个RoboTwin2.0任务上，PACE将平均成功率从57.8%提升至64.2%。在双臂ALOHA和单臂Franka平台上的真实机器人实验中，PACE将平均任务得分从60.7提升至77.7，平均成功率从50.7%提升至70.4%。消融实验和轨迹级分析表明，PACE跨操作阶段自适应调整执行步长，在过渡附近缩短执行，同时在连贯运动中保持较长执行。

英文摘要

Recent vision-language-action and diffusion-based robot policies often use action chunking, where each policy query predicts a sequence of future actions and the robot executes an open-loop prefix before re-querying. While this interface improves local motion continuity, deployment still requires choosing the execution horizon: how much of each predicted chunk should be executed before acquiring a new observation. However, our experiments show that success is strongly task-dependent and non-monotonic with respect to the execution horizon, making a single constant horizon an unreliable deployment rule. We propose PACE (Phase-Aware Chunk Execution), a training-free test-time execution method that selects the execution horizon online from the predicted chunk itself. PACE exploits the phase-dependent kinematic structure of manipulation trajectories by identifying low-speed transition points in the predicted speed profile and using them as candidate replanning boundaries. Because PACE uses only the predicted action chunk, it is plug-and-play and requires no retraining or access to policy internals. We validate PACE through large-scale evaluations in both simulation and real-robot settings. On 50 RoboTwin2.0 tasks, PACE raises the average success rate from 57.8% to 64.2%. In real-robot experiments on bimanual ALOHA and single-arm Franka platforms, PACE improves the average task score from 60.7 to 77.7 and the average success rate from 50.7% to 70.4%. Ablations and rollout-level analyses show that PACE adapts execution horizons across manipulation phases, shortening near transitions while preserving longer execution during coherent motion.

URL PDF HTML ☆

赞 0 踩 0

2606.00535 2026-06-02 cs.LG

DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation

DREAM-S: 基于可搜索草稿与目标感知精炼的推测解码用于多模态生成

Zining Liu, Yunhai Hu, Tianhua Xia, Bo Bao, Eric Sather, Vithursan Thangarasa, Sai Qian Zhang

发表机构 * New York University（纽约大学）； Cerebras Systems Inc.（Cerebras Systems公司）； University of Pennsylvania（宾夕法尼亚大学）

AI总结提出DREAM-S框架，通过神经架构搜索和目标感知超网训练自动优化草稿模型架构与交互策略，结合注意力熵引导的自适应中间特征蒸馏，实现视觉语言模型的高效推测解码，加速比达3.85倍。

详情

AI中文摘要

推测解码（SD）已被证明是加速大型语言模型（LLM）自回归生成的有效技术，然而其在视觉语言模型（VLM）中的应用仍相对未被探索。我们提出 extit{DREAM-S}，一个专门为VLM中快速高效解码设计的新型SD框架。DREAM-S利用神经架构搜索（NAS）框架与目标感知超网训练，自动识别草稿模型与目标模型之间的最优交互策略，以及最适合底层硬件实现平台的草稿模型架构。此外，DREAM-S还结合了由注意力熵引导的自适应中间特征蒸馏，以实现高效的草稿训练。在一系列成熟的VLM上的实验表明，与标准解码方法相比，DREAM-S实现了高达$3.85 imes$的加速，并显著优于现有的SD基线。代码已公开：https://github.com/SAI-Lab-NYU/DREAM-S。

英文摘要

Speculative decoding (SD) has proven to be an effective technique for accelerating autoregressive generation in large language models (LLMs) however, its application to vision-language models (VLMs) remains relatively unexplored. We propose~\textit{DREAM-S}, a novel SD framework designed specifically for fast and efficient decoding in VLMs. DREAM-S leverages a neural architecture search (NAS) framework with target-aware supernet training to automatically identify both the optimal interaction strategy between the draft and target models, and the most suitable draft model architecture for the underlying hardware implementation platform. DREAM-S additionally incorporates adaptive intermediate feature distillation, guided by attention entropy, to enable efficient draft training. Experiments on a range of well-established VLMs show that DREAM-S achieves up to a $3.85\times$ speedup compared to standard decoding approaches and significantly outperforms existing SD baselines. The code is publicly available at: https://github.com/SAI-Lab-NYU/DREAM-S .

URL PDF HTML ☆

赞 0 踩 0