arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.03083 2026-06-03 cs.AI

DELTAMEM: Incremental Experience Memory for LLM Agents via Residual Trees

DELTAMEM: 通过残差树为LLM智能体增量式经验记忆

Haoran Tan, Zeyu Zhang, Zhicheng Cao, Rui Li, Xu Chen

AI总结 提出DeltaMem框架,通过构建两个独立的残差树(目标条件任务经验和场景级环境知识)组织经验记忆,利用增量节点减少冗余,并通过失败惩罚相似度扫描和自主合并机制实现高效检索与自组织,在多种交互环境中优于现有基线。

详情
AI中文摘要

基于大语言模型的智能体越来越依赖记忆从持续交互中学习经验。然而,将经验存储为独立、扁平的单位会导致大量冗余和检索冲突,因为相似的情节重复重叠内容,而细微的场景变化导致检索到的记忆提供矛盾的指导。为了解决这个问题,我们引入残差经验的概念,认为新获得的经验通常是现有知识的增量变化。我们提出DeltaMem,一个将经验记忆组织成两个独立残差树的框架:一个存储目标条件任务经验作为可复用技能,另一个存储场景级环境知识。每个树使用一个根节点表示通用的基础经验,以及增量delta节点表示后续的变化,使得相关经验可以共享共同基础而不重复。对于检索,采用失败惩罚相似度扫描找到最佳匹配,并通过从根到匹配链的组合重构完整经验。一个自主合并机制将高频路径蒸馏成新的根节点,使树能够从通用启发式自组织为专门变体。在多种交互环境中的实验表明,DeltaMem持续优于现有基线。为促进未来研究,我们在该网址发布代码。

英文摘要

Large Language Model (LLM)-based agents increasingly rely on memory to learn from experiences over continual interactions. However, storing experiences as independent, flat units leads to substantial redundancy and retrieval conflicts, as similar episodes repeat overlapping content and subtle scene variations cause retrieved memories to offer contradictory guidance. To address this, we introduce residual experience, positing that newly acquired experience is often an incremental variation of existing knowledge. We propose DeltaMem, a framework that organizes experience memory into two independent residual trees, one storing goal-conditioned task experience as reusable skills and another for scene-level environment knowledge. Each tree uses a root node for generalized base experiences and incremental delta nodes for subsequent variations, allowing related experiences to share a common foundation without duplication. For retrieval, a failure-penalized similarity scan locates the best match, reconstructing the full experience via root-to-match chain composition. An autonomous consolidation mechanism distills high-frequency paths into new root nodes, enabling the trees to self-organize from general heuristics to specialized variants. Experiments across diverse interactive environments show that DeltaMem consistently outperforms existing baselines. To facilitate future research, we release the code at https://github.com/import-myself/DeltaMem.

2606.03080 2026-06-03 cs.CL cs.AI

Regret Pre-training: Bridging Prior and Posterior Views for Enhanced Knowledge Grounding

遗憾预训练:桥接先验与后验视角以增强知识基础

Mingkuan Zhao, Xiayu Sun, Wentao Hu, Suquan Chen, Jiaxuan Li, Xiaoyan Zhu, Xin Lai, Jiayin Wang

AI总结 提出遗憾预训练框架,通过双视角架构利用未来信息增强因果语言模型,在OLMoE-1B-7B架构上平均准确率提升至33.9%。

详情
AI中文摘要

因果语言模型仅使用前文上下文对序列概率进行分解,在训练过程中尽管未来信息在训练数据中可用,但仍未被利用。本文介绍了遗憾预训练,这是一个基于学习使用特权信息范式的自监督框架。该框架采用双视角架构,其中单个模型同时生成因果学生分布和未来条件教师分布。训练目标通过遗憾损失增强标准语言建模,该损失最小化从教师到学生的KL散度,将未来感知信号传递到因果表示中。我们在OLMoE-1B-7B架构上研究了两种教师配置:LocalRegret,它将注意力扩展一个未来标记;以及GlobalRegret,它以目标位置被掩码的双向上下文为条件。在40亿个标记的训练后,对九个下游任务的实验表明,两种配置均持续优于基线。平均而言,GlobalRegret和LocalRegret分别达到33.9%和32.2%的准确率,超过了基线的30.2%。最值得注意的是,GlobalRegret将BoolQ性能提高了18.1个百分点(61.0%对42.9%)。该框架不引入额外参数,每个训练步骤仅需一次额外的推理模式前向传播。

英文摘要

Causal language models factorize sequence probabilities using only preceding context, leaving future information unexploited during training despite its availability in the training data. This paper introduces Regret Pre-training, a self-supervised framework grounded in the Learning Using Privileged Information (LUPI) paradigm. The framework employs a dual-view architecture in which a single model generates both a causal Student distribution and a future-conditioned Teacher distribution. The training objective augments standard language modeling with a regret loss that minimizes the KL divergence from teacher to student, transferring future-aware signals to the causal representations. We investigate two teacher configurations on the OLMoE-1B-7B architecture:LocalRegret, which extends attention by one future token, andGlobalRegret, which conditions on bidirectional context with the target position masked. Experiments on nine downstream tasks following 4 billion tokens of training demonstrate that both configurations consistently outperform the baseline. On average,GlobalRegret andLocalRegret achieve 33.9% and 32.2% accuracy respectively, surpassing the baseline's 30.2%. Most notably,GlobalRegret improves BoolQ performance by 18.1 percentage points (61.0% vs 42.9%). The framework introduces no additional parameters and requires only one extra inference-mode forward pass per training step.

2606.03078 2026-06-03 cs.CL

G^2C-MT: Graph-Guided Context Selection for Document-Level Machine Translation

G^2C-MT: 面向文档级机器翻译的图引导上下文选择

Baijun Ji, Zixuan Zhou, Xiangyu Duan, Yu Liu, Longbo Sun, Rupu Wei, Bohong Zhao

AI总结 提出G^2C-MT方法,通过构建轻量级篇章图并采用深度偏置随机游走采样上下文路径,以结构化方式为文档级机器翻译选择上下文,提升翻译质量与鲁棒性。

详情
Comments
9 pages, 2 figures; IJCAI2026
AI中文摘要

有效的文档级机器翻译(DocMT)需要捕捉长距离篇章依赖。近期工作探索了基于检索和篇章感知的上下文选择方法。然而,这些方法通常缺乏对文档中远距离段落间结构化篇章依赖关系的显式建模。本文提出G^2C-MT(图引导的机器翻译上下文),将DocMT上下文选择视为轻量级篇章图上的结构化路径发现问题,而非检索非结构化上下文集或依赖昂贵的基于LLM的篇章建模。具体地,我们将每个段落表示为节点,并基于语义相似性、邻接性和关键词重叠建模每对节点间的关系。进一步,我们提出在图上进行深度偏置随机游走,为每个目标段落采样一条后向上下文路径。该上下文路径将用于提示大语言模型(LLM)进行翻译。该框架自然支持多路径上下文采样,通过聚合多样化的翻译候选来提升对篇章歧义输入的鲁棒性。跨多个领域的实验表明,G^2C-MT在多个LLM(包括DeepSeek-V3、Gemini-2.5-Flash-lite和Qwen-2.5/3系列)上均优于强基线。

英文摘要

Effective document-level machine translation (DocMT) requires capturing long-range discourse dependencies. Recent work has explored retrieval-based and discourse-aware context selection. However, these approaches often lack an explicit mechanism for modeling structured discourse dependencies between distant paragraphs in a document. In this paper, we propose G^2C-MT (Graph-Guided Context for Machine Translation), which views DocMT context selection as a structured path discovery problem on a lightweight discourse graph, rather than retrieving unstructured context sets or relying on expensive LLM-based discourse modeling. In detail, we represent each paragraph as a node and model the relationship between each pair of nodes, considering their semantic similarity, adjacency, and keyword overlap. Furthermore, we propose a depth-biased random walk over the graph to sample a backward context path for each target paragraph. The context path will be used to prompt a large language model (LLM) for translation. This framework naturally supports multi-path context sampling, which can improve robustness by aggregating diverse translation candidates for discourse-ambiguous inputs. Experiments conducted across various domains show that G^2C-MT outperforms strong baselines on multiple LLMs, including DeepSeek-V3, Gemini-2.5-Flash-lite, and the Qwen-2.5/3 series.

2606.03075 2026-06-03 cs.CV

TGV-KV: Text-Grounded KV Eviction for Vision-Language Models

TGV-KV:面向视觉语言模型的文本引导KV驱逐方法

Jizhihui Liu, Ruizi Han, Miao Zhang, Rui Shao, Xuebo Liu, Weili Guan, Yaowei Wang

AI总结 针对视觉语言模型中视觉信息冗余导致的KV缓存内存消耗问题,提出基于文本引导的KV驱逐方法TGV-KV,通过文本-视觉预算分配、文本加权排序和文本优先保留策略,在保持高精度的同时显著提升推理吞吐量。

详情
Comments
Accepted by ICML-2026
AI中文摘要

视觉语言模型(VLM)继承了自回归生成范式,并缓存所有先前token的键和值(KV)以加速推理,导致内存消耗随上下文长度线性增长。由于视觉模态中存在大量冗余,这一问题在VLM中尤为突出。尽管KV缓存驱逐方法可以有效减少推理内存,但它们通常会导致VLM性能显著下降,因为大多数方法是为语言模型设计的,忽视了文本与视觉之间的固有差距。通过系统分析VLM中的模态差距,我们认为视觉信息的重要性应以文本引导为基础,并据此提出了一种面向VLM的文本引导KV驱逐方法(TGV-KV)。TGV-KV包含三个子模块:(1)文本-视觉预算分配(TVB)基于互信息交互为每层分配预算。(2)文本加权排序(TWR)评估文本的优先级,并根据加权文本-图像注意力对视觉重要性进行排序。(3)文本优先保留(TPR)策略有选择地保留文本KV以避免严重的信息损失。我们在五种不同规模和架构的模型上评估了TGV-KV,结果显示,在LLaVA-NeXT的VizWiz-VQA任务中,TGV-KV保留了全KV准确率的99.2%,并在极端保留预算5%下将端到端吞吐量提升了52.6%。代码可在该https URL获取。

英文摘要

Vision-Language Models (VLMs) inherit the auto-regressive generation paradigm and cache the keys and values (KV) of all previous tokens to accelerate inference, resulting in memory consumption that scales linearly with context length. This issue is particularly pronounced in VLMs due to substantial redundancy in the visual modality. Although KV cache eviction approaches can effectively reduce inference memory, they often incur significant performance degradation in VLMs, as most are designed for language models and overlook the inherent gap between text and vision. By systematically analyzing the modality gap in VLMs in this work, we argue that the importance of visual information should be grounded in textual guidance and accordingly propose a Text-Grounded KV Eviction method for VLMs (TGV-KV). TGV-KV comprises three submodules: (1) Text-Vision Budgeting (TVB) assigns budget to each layer based on the mutual information interaction. (2) Text-Weighted Ranking (TWR) assesses the priority of text and ranks vision importance based on weighted text-image attention. (3) Text-Prioritised Retention (TPR) policy strategically preserves text KV to avoid acute information loss. We evaluate TGV-KV across five models with different sizes and architectures, showing that TGV-KV preserves 99.2% full-KV accuracy on the VizWiz-VQA task with LLaVA-NeXT and boosts end-to-end throughput by 52.6% with an extreme retention budget of 5%. Code is available at https://github.com/Danielement321/TGV-KV.

2606.03074 2026-06-03 cs.LG cs.SY eess.SY

RMPrior: Bridging Propagation Priors and Diffusion Refinement for Efficient Radio Map Construction

RMPrior: 融合传播先验与扩散精炼的高效无线电地图构建

Zixuan Guo, Xiucheng Wang, Nan Cheng

AI总结 提出一种中起点采样策略,通过将传播先验扰动至中间扩散时间步,仅执行剩余反向步骤,在加速2.01倍的同时提升重建质量,并理论分析了初始化差距的上界及截断条件。

详情
AI中文摘要

扩散模型通过迭代去噪实现高保真无线电地图构建,但其采样成本限制了在需要反复更新无线电地图的动态无线系统中的实用性。同时,经典传播模型编码了有价值的场景级知识,而标准扩散推理通过从纯高斯噪声初始化完全丢弃了这些知识。本文通过中起点采样策略桥接了传播先验与扩散精炼。匹配的传播先验被扰动至中间扩散时间步,预训练的扩散骨干仅执行剩余的反向步骤,将计算集中在多径感知精炼上,而非从噪声完全重建。我们提供了理论分析,建立了初始化差距的上界、截断提高重建保真度的充分条件,以及在激进截断下先验质量敏感性的形式化刻画。在IRT4HighRes上的实验表明,在$P_{ ext{start}}=0.5$时,所提方法实现了2.01倍的加速,同时在NMSE、RMSE、SSIM和PSNR上均优于全步基线。跨三个不同保真度传播模型的先验质量消融实验证实,重建质量跟踪先验质量,且敏感性在更短的反向轨迹下放大,与理论预测一致。这些结果还表明,中起点重建质量可作为不同传播模型场景级保真度排序的代理指标。

英文摘要

Diffusion models achieve high-fidelity radio map construction through iterative denoising, yet their sampling cost limits practicality in dynamic wireless systems where radio maps must be refreshed repeatedly. Meanwhile, classical propagation models encode valuable scene-level knowledge that standard diffusion inference discards entirely by initializing from pure Gaussian noise. This paper bridges propagation priors and diffusion refinement through a mid-start sampling strategy. A matched propagation prior is perturbed to an intermediate diffusion timestep, and the pretrained diffusion backbone executes only the remaining reverse steps, focusing computation on multipath-aware refinement rather than full reconstruction from noise. We provide theoretical analysis establishing an upper bound on the initialization gap, a sufficient condition under which truncation improves reconstruction fidelity, and a formal characterization of prior-quality sensitivity under aggressive truncation. Experiments on IRT4HighRes show that, at $P_{\text{start}}=0.5$, the proposed method achieves a $2.01\times$ speedup while simultaneously improving NMSE, RMSE, SSIM, and PSNR over the full-step baseline. A prior-quality ablation across three propagation models of different fidelity confirms that reconstruction quality tracks prior quality, with the sensitivity amplified under shorter reverse trajectories, consistent with the theoretical predictions. These results also suggest that mid-start reconstruction quality can serve as a proxy for ranking the scene-level fidelity of different propagation models.

2606.03073 2026-06-03 cs.LG cs.AI

Efficient Hyperparameter Optimization for LLM Reinforcement Learning

大语言模型强化学习的高效超参数优化

Minping Chen, Bowen Xiao, Du Liang, Chuxuan Zeng, Zeyi Wen

AI总结 提出联合保真度超参数优化方法,通过同时调整模型大小和训练预算作为保真度,并集成早停策略和检查点机制,显著提升计算效率(每轮最高14.9倍)且性能提升5.8%-111.6%。

详情
Comments
12 pages, 6 figures, accepted at ACL 2026
AI中文摘要

大语言模型的强化学习对超参数配置高度敏感,使得超参数优化至关重要但计算成本高昂。现有的多保真度超参数优化方法由于模型规模庞大和训练周期资源密集,在LLM RL中仍然效率低下。本文提出联合保真度超参数优化(JF-HPO),它同时调整模型大小和训练预算作为保真度。JF-HPO通过以下方式实现:(i)在每次HPO试验中,利用目标LLM的小型代理模型进行高效训练和评估;(ii)基于训练动态整合精心设计的早停策略;(iii)引入高效的检查点机制以消除冗余计算。与现有HPO方法相比,JF-HPO显著提高了每次试验的计算效率(最高达14.9倍),同时在相同时间预算下达到更好或具有竞争力的预测精度。值得注意的是,与使用VeRL配方中的超参数配置相比,JF-HPO的性能提升范围从5.8%到111.6%。

英文摘要

Reinforcement learning (RL) for large language models (LLMs) is highly sensitive to hyperparameter configurations, making hyperparameter optimization (HPO) essential yet computationally expensive. Existing multi-fidelity HPO methods remain inefficient for LLM RL due to the massive model scale and resource-intensive training cycles. In this paper, we propose Joint Fidelity Hyperparameter Optimization (JF-HPO), which simultaneously adapts both model size and training budget as fidelity. JF-HPO is empowered by: (i) it leverages a small proxy model of the target LLM for efficient training and evaluation in each HPO trial; (ii) it integrates carefully designed early-stopping strategies based on training dynamics; (iii) it introduces an efficient checkpointing mechanism to eliminate redundant computations. Compared with existing HPO methods, JF-HPO significantly improves the computational efficiency of each trial (up to 14.9 times), while achieving better or competitive predictive accuracy under the same time budget. Notably, compared with utilizing hyperparameter configurations from the VeRL Recipe, JF-HPO delivers performance improvements ranging from 5.8% to 111.6%.

2606.03069 2026-06-03 cs.CV cs.AI cs.LG

ROBUST-WT: Robust Uncertainty-aware Segmentation Transform via Whitening and Training Enhancements

ROBUST-WT: 通过白化和训练增强的鲁棒不确定性感知分割变换

Aqsa Naseer, Maryam Bibi, Syeda Samiya Urooj, Muhammad Khurram Shahzad

AI总结 针对WT-PSE框架的四个局限性,提出域自适应增强、混合损失函数、课程式权重调度和消融控制标志四种改进,在眼底视盘分割中Dice达0.956。

详情
Comments
8 pages, 6 figures; code available at https://github.com/213269/WT-PSE-code-main
AI中文摘要

医学图像的广义分割可防止跨多个领域使用不同成像设备和临床协议时的性能下降。基于白化变换的概率形状正则化提取器(WT-PSE)发表于2024年IEEE Transactions on Medical Imaging,通过特征去相关和基于Wasserstein距离的知识蒸馏实现鲁棒的跨域分割。本研究系统性地检查了对WT-PSE学习框架的改进。识别出原始实现中的四个局限性:有限的训练增强无法模拟真实的扫描仪变化;依赖逐像素二元交叉熵损失对边缘噪声敏感;缺乏调度损失加权策略可能导致早期训练不稳定;以及缺乏用于受控科学比较的消融开关。为解决这些问题,我们提出四项增强:(1) 域自适应增强,包括随机擦除、伽马校正和椒盐噪声;(2) 混合BCE和Dice损失函数,用于在噪声条件下改进边缘感知分割;(3) 基于课程的Dice权重调度策略;(4) 命令行控制标志用于系统消融研究。在眼底视盘分割基准上的实验表明,改进后的流程在最终epoch的视盘Dice得分为0.956,ASD得分为13.31,优于基线epoch-5的Dice得分0.939。这些结果表明,在不修改底层WT-PSE架构的情况下,训练层面的改进可以提供一致的性能提升。

英文摘要

Generalized segmentation of medical images prevents performance degradation when different imaging devices and clinical protocols are used across multiple domains. The Whitening Transform-based Probabilistic Shape Regularization Extractor (WT-PSE), published in IEEE Transactions on Medical Imaging in 2024, addresses this challenge by employing feature decorrelation and Wasserstein distance-based knowledge distillation to achieve robust cross-domain segmentation. This study systematically examines improvements to the WT-PSE learning framework. Four limitations in the original implementation are identified: limited training augmentations that fail to simulate real scanner variations, reliance on per-pixel binary cross-entropy loss that is sensitive to edge noise, the absence of a scheduled loss weighting strategy that may destabilize early training, and the lack of ablation switches for controlled scientific comparison. To address these issues, we propose four enhancements: (1) domain-adaptive augmentation including random erasing, gamma correction, and salt-and-pepper noise; (2) a hybrid BCE and Dice loss function for improved edge-aware segmentation under noisy conditions; (3) a curriculum-based Dice weight scheduling strategy; and (4) command-line control flags for systematic ablation studies. Experiments on the fundus optic disc segmentation benchmark demonstrate that the improved pipeline achieves a final epoch optic-disc Dice score of 0.956 and an ASD score of 13.31, outperforming the baseline epoch-5 Dice score of 0.939. These results indicate that training-level improvements can provide consistent performance gains without modifying the underlying WT-PSE architecture.

2606.03068 2026-06-03 cs.LG cs.AI

Learn When and Where to Connect: Adaptive Virtual Nodes for Dynamic Message Passing on Graphs

学习何时何地连接:图上动态消息传递的自适应虚拟节点

Jaejun Lee, Joyce Jiyoung Whang

AI总结 提出MAVN框架,通过端到端可微分的方式自适应地决定在消息传递神经网络的哪一层为哪些节点引入虚拟节点,并基于双向评分机制建立连接,理论证明其能模拟任意节点-虚拟节点连接模式,实验表明在多个数据集上显著提升骨干网络性能。

详情
Comments
12 pages, 6 figures, 10 tables, 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)
AI中文摘要

虽然虚拟节点(VN)常用于消息传递神经网络(MPNN)中以促进有效的消息传递,但现有的基于VN的方法存在局限性,例如限制所有节点连接到相同数量的VN、在应用MPNN之前固定连接,以及独立于连接到同一VN的其他节点而将节点连接到VN。我们提出了MAVN,一个端到端可微分的MPNN框架,允许节点和VN之间无约束的连接,并根据跨层演化的节点表示动态按需引入VN。具体来说,MAVN学习基于连接的相对重要性自适应地决定何时(在哪一层)以及何地(连接到哪些节点)引入和连接VN。从候选VN池中,MAVN在每一层选择必要的VN,每个选中的VN连接到非空节点子集,由双向评分机制引导,该机制同时捕捉节点对VN的偏好和VN对节点的偏好。我们理论上证明,对于任何节点-VN连接模式,都存在一组MAVN参数可以模拟该模式。在九个真实世界数据集上的实验表明,MAVN持续提升骨干MPNN的性能,相对于骨干网络实现高达46.5%的提升,并优于基线方法。

英文摘要

While Virtual Nodes (VNs) are often utilized in Message Passing Neural Networks (MPNNs) to facilitate effective message passing, existing VN-based methods have limitations, such as constraining all nodes to connect to the same number of VNs, fixing the connections before applying MPNNs, and connecting a node to a VN independently of the other nodes that connect to the same VN. We propose MAVN, an end-to-end differentiable MPNN framework that allows non-constrained connections between nodes and VNs and dynamically introduces VNs on demand in response to evolving node representations across layers. Specifically, MAVN learns to adaptively determine when (at which layer) and where (to which nodes) to introduce and connect VNs based on the relative importance of connections. From a pool of candidate VNs, MAVN selects the necessary VNs in each layer, where each selected VN is connected to a nonempty subset of nodes, guided by a dual-perspective scoring mechanism that jointly captures the nodes' preferences for VNs and the VNs' preferences for nodes. We theoretically prove that for any node-VN connectivity pattern, there exists a set of MAVN's parameters that can simulate the pattern. Experiments on nine real-world datasets demonstrate that MAVN consistently improves the performance of backbone MPNNs, achieving up to 46.5% improvement over the backbones and outperforms the baselines.

2606.03066 2026-06-03 cs.AI

CORE: Conflict-Oriented Reasoning for General Multimodal Manipulation Detection

CORE: 面向冲突的通用多模态篡改检测推理

Jinjie Shen, Yaxiong Wang, Yujiao Wu, Lechao Cheng, Tianrui Hui, Nan Pu, Zhihui Li, Zhun Zhong

AI总结 提出CORE框架,通过构建冲突归因语料库和面向冲突的推理,增强多模态大语言模型的冲突捕捉能力,实现鲁棒且泛化的多模态篡改检测。

详情
Comments
Accepted by ICML 2026
AI中文摘要

生成式AI的快速崛起使得多模态假新闻日益逼真且泛滥,对公众信任和社会稳定构成严重威胁。现有检测方法严重依赖针对特定篡改的模型和大规模标注数据,导致对新兴篡改类型的泛化能力差。我们观察到,篡改误导信息的本质在于其内在冲突,即跨模态或与常识世界知识之间的语义或物理不一致。受此启发,我们提出面向冲突的推理(CORE)框架,这是一种有效的范式,通过学习赋予多模态大语言模型(MLLMs)显式的冲突捕捉能力。为此,CORE首先构建了冲突归因语料库(CAC),包含冲突因素和来源的细粒度标注,为后续的冲突感知训练提供必要的数据支持。通过基于CAC进行面向冲突的表示增强和推理,CORE实现了鲁棒且可泛化的冲突检测,能够有效且快速地适应未见过的篡改类型,仅需少量样本甚至零样本设置。大量实验表明,CORE超越了现有最先进模型。数据集和代码已公开于该链接。

英文摘要

The rapid rise of generative AI has made multimodal fake news increasingly realistic and pervasive, posing severe threats to public trust and social stability. Existing detection methods rely heavily on manipulation-specific models and large-scale labeled data, resulting in poor generalization to emerging manipulation types. We observed that the essence of manipulated misinformation lies in its intrinsic conflicts, \textbf{i.e.,} semantic or physical inconsistencies either across modalities or with common world knowledge. Inspired by this observation, we propose \textbf{C}onflict-\textbf{O}riented \textbf{RE}asoning (\textbf{CORE}) framework, an effective paradigm that learns to endows multimodal large language models (MLLMs) with explicit conflict-capturing capability. To this end, CORE first constructs the Conflict Attribution Corpus (CAC) with fine-grained annotations of conflict factors and sources, providing essential data support for subsequent conflict perception training. By performing conflict-oriented representation enhancement and reasoning based on CAC, CORE achieves robust and generalizable conflict detection, effectively and rapidly adapting to unseen manipulation types with a few samples or in even zero-shot settings. Extensive experiments demonstrate that CORE surpasses state-of-the-art models. The dataset and code are publicly available at https://github.com/shen8424/CORE.

2606.03063 2026-06-03 cs.LO cs.CL

ZX-Calculus:Trace-Indexed Dependent Types and Epistemic Semantics

ZX-演算:迹索引依赖类型与认知语义

Peng Chen

AI总结 提出ZX-演算,通过迹索引类型、预层非单调语义和构造性AGM信念修正,保守扩展Martin-Löf依赖类型论,并给出Coq机械化证明。

详情
AI中文摘要

我们提出ZX-演算(知识演化演算),它是Martin-Löf依赖类型论(MLTT)的保守扩展,集成了迹索引类型、预层非单调语义和构造性AGM信念修正。本文附带Coq机械化证明(34个完整证明;两个核心结果零未完成)。(I)迹类型。FinTrace(s0,sn)是一个带类型的执行迹的归纳族。FinTrace和Star(Step)作为路径类型同构,但判断上不相等;TraceElim显式暴露事件标签e:Event,为事件驱动归纳提供了更符合人体工程学的接口。我们证明了迹可达性对应、确定性重放以及通过可归约候选(带传输引理,RC-elim推迟;所有其他核心结果经Coq验证)的规范性框架。(II)层语义。迹索引命题是自由迹偏序范畴Tf上的逆变层。分离定理(显式反模型)区分了证明论单调性和语义非单调性。项模型是初始CwF(句法泛性质,非经典完备性)。(III)AGM信念修正。我们给出了一个显式的构造性部分交收缩算法,经(C1)-(C4)验证。所有八条AGM公设(R1)-(R8)都是定理。R7和R8的证明使用了析取加固引理,并给出了自包含的构造性推导。(IV)集成。B^AGM在顺序修正中不满足层复合律BP-comp(显式反模型,Coq验证)。我们引入单步修正系统(SSRS),证明B^AGM是有效的SSRS(Coq验证),并表明这足以处理迹态射、收缩刻画和修正见证。BP-comp失败揭示了路径依赖信念修正与函子一致性之间的基本张力,此前未被识别。

英文摘要

We propose ZX-Calculus (Knowledge Evolution Calculus), a conservative extension of Martin-Lof Dependent Type Theory (MLTT) integrating trace-indexed types, presheaf non-monotone semantics, and constructive AGM belief revision. A Coq mechanisation accompanies the paper (34 complete proofs; zero admits for the two central results). (I) Trace types. FinTrace(s0,sn) is an inductive family of typed execution traces. FinTrace and Star(Step) are isomorphic as path types but not judgementally equal; TraceElim exposes the event label e:Event explicitly, giving a more ergonomic interface for event-driven induction. We prove the Trace-Reachability Correspondence, Deterministic Replay, and a canonicity framework via reducibility candidates with a Transport Lemma (RC-elim deferred; all other Core results are Coq-verified). (II) Sheaf semantics. Trace-indexed propositions are contravariant sheaves over the free trace partial-order category Tf. A Separation Theorem (explicit countermodel) distinguishes proof-theoretic monotonicity from semantic non-monotonicity. The term model is an initial CwF (syntactic universal property, not classical completeness). (III) AGM belief revision. We give an explicit constructive partial meet contraction algorithm verified against (C1)-(C4). All eight AGM postulates (R1)-(R8) are theorems. Proofs of R7 and R8 use the Disjunctive Entrenchment Lemma, given a self-contained constructive derivation. (IV) Integration. B^AGM fails the sheaf composition law BP-comp for sequential revision (explicit countermodel, Coq-verified). We introduce Single-Step Revision Systems (SSRS), prove B^AGM is a valid SSRS (Coq-verified), and show this suffices for trace morphisms, retraction characterisation, and revision witnesses. The BP-comp failure reveals a fundamental tension between path-dependent belief revision and functor consistency, not previously identified.

2606.03061 2026-06-03 cs.DC cs.AI cs.LG cs.NI cs.SY eess.SY

Brief Announcement: Generative Markov Model for Distributed Computing Systems

简要公告:分布式计算系统的生成马尔可夫模型

Alfreds Lapkovskis, Ali Beikmohammadi, Sindri Magnússon, Praveen Kumar Donta

AI总结 针对分布式计算系统的异构性和复杂性,提出一种基于结构化状态分解的生成马尔可夫模型,实现可处理的模拟、推理和策略学习,并通过协作AI推理案例验证其有效性。

详情
Comments
Submitted to 40th International Symposium on Distributed Computing (DISC 2026)
AI中文摘要

新兴的分布式计算范式,如计算连续体,本质上是异构、随机和复杂的。高效且有效地利用连续体中所有可用资源需要一个统一的系统形式化模型。为了解决这一差距,我们提出了一个通用框架,将分布式计算系统建模为生成马尔可夫模型,该模型在结构化系统状态上进行分解。在我们的模型中,状态分解为高维变量,每个变量进一步在其元素上分解,反映了分布式系统固有的稀疏依赖结构。这产生了一个可处理的模型,能够对原本难以处理的系统状态进行模拟、推理和策略学习,从而将分布式计算与马尔可夫链理论和强化学习(RL)联系起来。我们通过一个协作AI推理的案例研究来展示我们的框架,其中专用服务器将资源与服务用户自愿提供的资源相结合。我们的结果表明,集中式调度在规模上成为瓶颈,而将计算分布到用户设备上可减少延迟和服务器资源消耗。这些发现突显了自适应决策在分布式计算系统中的价值,并展示了该框架在建模、模拟和优化方面的实用性。

英文摘要

Emerging distributed computing paradigms, such as the computing continuum, are inherently heterogeneous, stochastic, and complex. Efficiently and effectively utilizing all available resources across the continuum demands a unified formal model of the system. To address this gap, we propose a general framework for modeling distributed computing systems as a generative Markov model, factorized over a structured system state. In our model, the state decomposes into high-dimensional variables, each further factorized over its elements, reflecting the sparse dependency structure inherent to distributed systems. This yields a tractable model enabling simulation, inference, and policy learning over otherwise intractable system states, bridging distributed computing with Markov chain theory and reinforcement learning (RL). We demonstrate our framework through a case study of collaborative AI inference, in which a dedicated server combines resources with those volunteered by service users. Our results show that centralized scheduling becomes a bottleneck at scale, while distributing computation across user devices reduces both latency and server resource consumption. These findings highlight the value of adaptive decision-making in distributed computing systems and demonstrate the framework's utility for modeling, simulation, and optimization.

2606.03057 2026-06-03 cs.LG cs.AI

Rethinking Molecular Text Representations for LLMs: An Empirical Study

重新思考用于大语言模型的分子文本表示:一项实证研究

Arun Raja, Garrett M. Morris, Kian Ming A. Chai

AI总结 通过系统基准测试,评估了9种分子表示和8种化学任务下16个LLM的性能,发现表示选择强烈影响结果,结构化文本表示(CML、MolJSON)在结构任务中占优,IUPAC在语义任务中占优,而SMILES很少最优。

详情
Comments
25 pages, 11 figures, 20 tables
AI中文摘要

大语言模型(LLMs)越来越多地用于分子任务,但目前尚不清楚使用哪种分子表示。我们提出了一个系统基准测试,评估了LLM在九种表示和八种化学任务上的分子能力。我们基准测试了16个LLM,涵盖五个模型家族,包括推理和非推理变体、化学专用LLM以及封闭前沿模型。性能强烈依赖于表示,没有单一表示在所有任务中获胜,尽管CML是最好的,其次是MolJSON、InChI,然后是规范SMILES。显式结构化文本表示(CML和MolJSON)主导结构任务;IUPAC主导语义任务,在所有16个LLM的分子检索中获胜;而SMILES变体尽管在预训练中普遍存在,但很少是最优的。化学专用模型在使用SMILES时表现良好,但使用结构化文本表示时性能大幅下降,这表明仅基于SMILES的评估奖励了不具泛化能力的专业化。使用LLM作为评判者,我们发现IUPAC产生的正确分子生成比例最高。通过分词审计、线性探针和注意力的机制研究表明,表示在模型内部以不同方式编码;例如,结构化表示需要跨分子范围的更高注意力。我们的结果反对表示不变的评估,并激励基于LLM的化学任务感知表示路由。

英文摘要

Large language models (LLMs) are increasingly used for molecular tasks, but it remains unclear which molecular representation to use. We present a systematic benchmark evaluating LLM molecular competence across nine representations and eight chemical tasks. We benchmark 16 LLMs across five model families, including reasoning and non-reasoning variants, chemistry-specialized LLMs, and closed frontier models. Performance is strongly representation-dependent and no single representation wins across tasks, though CML is the best, followed by MolJSON, InChI, and then canonical SMILES. Explicit structured text representations (CML and MolJSON) dominate structural tasks; IUPAC dominates semantic tasks, winning molecule retrieval for all 16 LLMs; and SMILES variants are rarely optimal despite their prevalence in pretraining. Chemistry-specialized models perform well with SMILES at the cost of large degradations with structured text representations, suggesting SMILES-only evaluation rewards specialization that does not generalize. Using LLM-as-a-judge, we find that IUPAC produces the highest fraction of correct molecule generations. A mechanistic study via tokenization audits, linear probes and attention shows that representations are encoded differently inside the model; for example, structured representations require higher attention across the molecular span. Our results argue against representation-invariant evaluation and motivate task-aware representation routing for LLM-based chemistry.

2606.03056 2026-06-03 cs.AI

SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

SkillDAG:面向大规模LLM技能选择的自演化类型化技能图

Tong Bai, Zhenglin Wan, Pengfei Zhou, Xingrui Yu, Wangbo Zhao, Yang You, Ivor W. Tsang

AI总结 提出SkillDAG,通过类型化有向图建模技能间关系,并作为推理时可调用的结构化检索接口,支持在线演化,在ALFWorld和SkillsBench上显著超越基线。

详情
Comments
19 pages, 5 figures
AI中文摘要

随着LLM智能体采用大规模技能库,选择合适的子集成为一个结构性问题而非相似性匹配问题:技能之间存在依赖、冲突、特化或重复关系,这种结构对于全枚举和嵌入相似性都是不可见的。我们提出SkillDAG,将技能间关系建模为类型化有向图,并将其作为推理时、智能体可调用的结构化检索接口暴露给LLM智能体,在执行过程中查询和演化,而非固化在固定的检索流水线中:每次搜索返回向量匹配、类型化边邻居和冲突信号,并通过提议-提交协议让智能体注册基于执行的边,从而使图在多个回合中积累结构。在ALWWorld和SkillsBench上使用MiniMax-M2.7,SkillDAG达到67.1%的成功率和27.3%的奖励,比最强报告的Graph-of-Skills基线分别高出+12.8和+8.6个百分点;该优势可移植到gpt-5.2-codex,且在匹配查询下,内在SkillsBench Ret@K从65.5提升至78.2。这些增益可追溯到可隔离的机制:候选排序在池规模扩大10倍时保持鲁棒,而固定种子扩散流水线会退化;以及集合单调的在线编辑,在不驱逐先前命中项的情况下扩大地面真实召回率。

英文摘要

As LLM agents adopt large skill libraries, selecting the right subset becomes a structural problem rather than a similarity-matching one: skills depend on, conflict with, specialize, or duplicate one another, a structure invisible to both full enumeration and embedding similarity. We present SkillDAG, which models inter-skill relationships as a typed directed graph and exposes it to an LLM agent as an inference-time, agent-callable structural retrieval interface, queried and evolved during execution rather than baked into a fixed retrieval pipeline: each search returns vector matches, typed-edge neighbors, and conflict signals, and a propose-then-commit protocol lets the agent register execution-backed edges so the graph accumulates structure across episodes. On ALFWorld and SkillsBench with MiniMax-M2.7, SkillDAG reaches 67.1% success and 27.3% reward, exceeding the strongest reported Graph-of-Skills baseline by +12.8 and +8.6 points; the advantage ports to gpt-5.2-codex, and intrinsic SkillsBench Ret@K rises from 65.5 to 78.2 under matched queries. These gains trace to isolable mechanisms: candidate ranking that stays robust as the pool grows 10x where a fixed seeding-diffusion pipeline degrades, and set-monotone online edits that enlarge ground-truth recall without evicting prior hits.

2606.03054 2026-06-03 cs.AI

ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents

ToolGate: 面向工具增强视觉语言智能体的令牌高效预调用控制

Anjie Liu, Yan Song, Zhixun Chen, Ziqin Gong, Zhongwei Yu, Jun Wang

AI总结 针对工具增强视觉语言智能体中工具调用成本高且不必要的问题,提出轻量级外部控制器ToolGate,通过轨迹文本和结构特征预测执行/跳过决策,在降低令牌成本的同时保持或提升准确率。

详情
AI中文摘要

工具增强的视觉语言智能体可以通过OCR、检测、分割等工具获取外部感知证据,但执行每个提议的工具调用成本高昂且有时不必要。我们研究了预调用控制问题:在ReAct风格的VLM智能体提出感知工具调用后,是否应执行该调用,还是在其输出进入上下文之前跳过?在五个基准测试中,我们发现基线智能体表现出较差的局部选择性:有益和有害调用的发生率相近(11.8% vs. 9.9%),而大多数调用不会改变即时强制答案预测。我们引入了ToolGate,一个轻量级外部控制器,它根据轨迹文本和简单的结构特征预测执行/跳过决策。在两个Qwen3-VL骨干网络上,ToolGate将令牌成本降低到无限制ReAct基线的64-69%,同时保持跨域设置的平均准确率。在Qwen3-VL-30B上进行匹配域轨迹训练后,它进一步将平均准确率提高了1.65个百分点。这些结果表明,工具增强的VLM智能体不仅受益于更好的感知工具,还受益于对工具输出何时值得付费的显式控制。

英文摘要

Tool-augmented vision-language agents can acquire external perceptual evidence through OCR, detection, segmentation, and other tools, but executing every proposed tool call is costly and sometimes unnecessary. We study the pre-call control problem: after a ReAct-style VLM agent proposes a perceptual tool call, should the call be executed, or skipped before its output enters the context? Across five benchmarks, we find that the baseline agent exhibits poor local selectivity: helpful and harmful calls occur at similar rates (11.8% vs. 9.9%), while most calls do not change the immediate forced-answer prediction. We introduce ToolGate, a lightweight external controller that predicts execute/skip decisions from trajectory text and simple structural features. Across two Qwen3-VL backbones, ToolGate reduces token cost to 64-69% of the unrestricted ReAct baseline while preserving average accuracy in cross-domain settings. With matched-domain trajectory training on Qwen3-VL-30B, it further improves average accuracy by 1.65 points. These results show that tool-augmented VLM agents benefit not only from better perceptual tools, but also from explicit control over when tool outputs are worth paying for.

2606.03052 2026-06-03 cs.LG

What Do Students Learn? A Feature-Level Analysis of Dark Knowledge

学生学到了什么?暗知识的特征级分析

Seungu Kang, Songkuk Kim

AI总结 本文利用交互张量框架分析知识蒸馏中学生模型的特征学习,发现有效蒸馏作为正则化器去除低频样本特定特征,并提出基于混淆矩阵的教师无关自蒸馏方法混淆蒸馏(CD),在CIFAR-100上优于现有自蒸馏方法。

详情
Comments
Accepted at ICPR 2026
AI中文摘要

知识蒸馏(KD)是模型压缩的强大工具,然而学生模型获取特征表示的确切机制仍未充分探索。在这项工作中,我们使用交互张量框架分析学生特征学习。我们的分析表明,有效的KD充当正则化器,修剪低频、样本特定的特征,鼓励学生依赖一组紧凑的高可重用特征。至关重要的是,我们观察到数据集级别的混淆矩阵包含类似于教师“暗知识”的结构信息。利用这一见解,我们提出了混淆蒸馏(CD),一种无教师自蒸馏方法,利用模型自身不断演化的混淆模式作为动态软目标。CD在CIFAR-100上的ResNet-34和ResNet-50上取得了有竞争力的性能,比现有的自蒸馏方法如CS-KD和PS-KD高出1.2%,同时提供了标准KD的计算高效替代方案。

英文摘要

Knowledge Distillation (KD) is a powerful tool for model compression, yet the precise mechanisms by which student models acquire feature representations remain underexplored. In this work, we analyze student feature learning using the Interaction Tensor framework. Our analysis reveals that effective KD acts as a regularizer that prunes low-frequency, sample-specific features, encouraging the student to rely on a compact set of highly reusable features. Crucially, we observe that the dataset-level confusion matrix contains structural information analogous to the teacher's "Dark Knowledge." Leveraging this insight, we propose Confusion Distillation (CD), a teacher-free self-distillation method that utilizes the model's own evolving confusion patterns as dynamic soft targets. CD achieves competitive performance on ResNet-34 and ResNet-50 for CIFAR-100, outperforming existing self-distillation methods like CS-KD and PS-KD by 1.2% while offering a computationally efficient alternative to standard KD.

2606.03050 2026-06-03 cs.CV

FCUS-rPPG: A Fast-Converging Unsupervised Framework for Remote Photoplethysmography via Gradient Oscillation Suppression

FCUS-rPPG:一种通过梯度振荡抑制实现快速收敛的无监督远程光电容积描记框架

Jiajie Li, Yu Liu, Rencheng Song, Xun Chen, Juan Cheng

AI总结 提出FCUS-rPPG框架,通过光谱共享骨干网络和梯度、损失景观、特征表示层面的统一优化,实现单轮训练收敛并在跨数据集评估中达到最优性能。

详情
AI中文摘要

远程光电容积描记术(rPPG)利用消费级摄像头实现非接触式血容量脉搏(BVP)信号提取。现有的无监督rPPG方法无需真实生理标注即可学习BVP表示,但其优化常受噪声和不稳定梯度影响,导致收敛缓慢且跨域泛化能力有限。本文提出FCUS-rPPG,一种快速收敛且具有强泛化能力的无监督rPPG框架。受BVP表示同时具有多光谱共变和低维流形结构的观察启发,我们设计了光谱共享骨干网络,促进BVP特征解耦并提高优化效率。为了联合增强收敛稳定性和泛化性能,我们进一步开发了一个在梯度、损失景观和特征表示层面运作的统一优化框架。具体而言,后验证掩蔽机制根据BVP信号的弱幅度生理先验过滤误导性梯度;基于扰动的损失景观平滑策略将优化导向更可泛化的平坦最小值;噪声感知零空间正则化将特征更新约束在噪声子空间的正交补空间内,从而减轻噪声引起的表示漂移。在五个数据集上的大量实验表明,FCUS-rPPG仅需一个训练周期,而现有方法通常需要数十到数百个周期。值得注意的是,FCUS-rPPG在跨数据集评估中持续达到最先进(SOTA)性能。本研究为无监督rPPG的实际部署提供了高效且鲁棒的解决方案。源代码将在该URL公开。

英文摘要

Remote photoplethysmography (rPPG) enables non-contact extraction of blood volume pulse (BVP) signals using consumer-grade cameras. Recent unsupervised rPPG methods learn BVP representations without requiring ground-truth physiological annotations, yet their optimization is often hindered by noisy and unstable gradients, resulting in slow convergence and limited cross-domain generalization. In this paper, we propose FCUS-rPPG, a fast-converging unsupervised rPPG framework with strong generalization capability. Motivated by the observation that BVP representations exhibit both multi-spectral covariation and low-dimensional manifold structure, we design a spectrally shared backbone that facilitates BVP feature disentanglement while improving optimization efficiency. To jointly enhance convergence stability and generalization performance, we further develop a unified optimization framework operating at the gradient, loss-landscape, and feature-representation levels. Specifically, a post-verification masking mechanism filters out misleading gradients according to the weak-amplitude physiological prior of BVP signals; a perturbation-based loss landscape smoothing strategy steers optimization toward more generalizable flat minima; and a noise-aware null-space regularization constrains feature updates to the orthogonal complement of the noise subspace, thereby mitigating noise-induced representation drift. Extensive experiments on five datasets demonstrate that FCUS-rPPG requires only one training epoch, whereas existing methods typically require tens to hundreds of epochs. Notably, FCUS-rPPG consistently achieves state-of-the-art (SOTA) performance in cross-dataset evaluations. This study provides an efficient and robust solution to the real-world deployment of unsupervised rPPG. The source code will be publicly available at https://github.com/JiaJieLee/FCUS-rPPG.

2606.03047 2026-06-03 cs.RO cs.MA

ModuLoop : Low-Level Code Generation using Modular Synthesizer and Closed-Loop Debugger for Robotic Control

ModuLoop: 使用模块化合成器和闭环调试器进行机器人控制的低级代码生成

Gina Yoon, Sumin Lee, Joo Yong Sim

AI总结 提出闭环模块化代码合成框架,利用预训练大语言模型进行模块化代码规划与生成,并通过迭代执行和调试探针实现系统调试与优化,成功应用于RGB-D相机与机械臂标定及抓取任务。

详情
Comments
IEEE Robotics and Automation Letters (2025)
AI中文摘要

大型语言模型(LLMs)在包括代码生成和问题解决在内的各个领域展示了令人印象深刻的表现。然而,它们在机器人控制中的应用,特别是在需要精确操作、实时反馈和环境依赖执行的低级任务中,仍然有限。为了解决这一挑战,我们提出了闭环模块化代码合成框架。该框架利用预训练的LLM,无需任何任务特定的微调,执行模块化代码规划和生成,并在迭代执行生成的代码的同时插入调试探针以观察其行为。这种闭环结构促进了系统性的调试和优化,最终生成可执行的控制程序。我们将该框架应用于RGB-D相机和机械臂的标定,验证了其在真实世界环境中的有效性。此外,通过后续的抓取任务,我们不仅展示了标定的准确性,还展示了框架的潜在可扩展性。在两个任务中,该框架都实现了高执行准确性和自主性,说明了使用我们框架进行基于LLM的机器人控制的实用性和可扩展性。

英文摘要

Large Language Models (LLMs) have demonstrated impressive performance across various domains, including code generation and problem solving. However, their application in robotic control, particularly in low-level tasks that require precise manipulation, real-time feedback, and environment-dependent execution, remains limited. To address this challenge, we propose the Closed-Loop Modular Code Synthesizer framework. This framework leverages a pre-trained LLM without any task-specific fine-tuning to perform modular code planning and generation, and iteratively executes the generated code while inserting debugging probes to observe its behavior. This closed-loop structure facilitates systematic debugging and refinement, ultimately producing executable control programs. We apply the proposed framework to the calibration of an RGB-D camera and a robotic arm, validating its effectiveness in real-world settings. Furthermore, through a subsequent pick-and-place task, we demonstrate not only the accuracy of the calibration but also the potential extensibility of the framework. Across both tasks, the framework achieved high execution accuracy and autonomy, illustrating the practicality and scalability of LLM-based robotic control using our framework.

2606.03043 2026-06-03 cs.CL

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

LLM作为评判者的几何学:为什么LLM间共识不等于人类对齐

Sourabrata Mukherjee, Hamna Hamna, Kalika Bali, Sunayana Sitaram

AI总结 通过几何量测量,发现LLM评判者之间高度一致但与人类对齐差,其评分子空间与人类子空间几乎正交,共识源于子空间坍缩而非人类对齐。

详情
AI中文摘要

LLM作为评判者现已普遍使用,但评判者之间高度一致,与人类却只有弱一致性。我们通过测量四个社区构建的印地语数据集、八种印地语语言和41个LLM评判者的四个几何量(分数分布、有效秩、与人类子空间的主角、评判者与人类之间的堆叠相关性,均带有自助置信区间)来检验这是共享信号还是共享偏差。在主观评分标准上,评判者使用的分数范围不到人类的一半($\sigma_J / \sigma_H \approx 0.3$--$0.5$)。他们的评估轴几乎与人类正交,且明显比人类彼此之间更远离人类($87^\circ$--$89^\circ$ 对比 $78^\circ$--$81^\circ$)。LLM间一致性超过LLM-人类一致性($r_{LL} \approx 0.35$ 对比 $r_{LH} \approx 0.27$--$0.32$)。在具有可验证事实答案的评分标准上,相同的诊断指标回落到人类范围内(轴 $58.5^\circ$;$r_{LH} = 0.519$)。微调和偏好优化恢复了分数分布($0.32 \rightarrow 1.08$),但几乎不改变轴(仍为 $87^\circ$--$88^\circ$)。只有在小的人类锚定集上的后验校准才能同时改善所有四个社区健康评分标准,使校准后的24B印地语评判者($r = 0.184$)优于GPT-5.5($r = 0.123$),但仍未达到人类可靠性(在可验证评分标准上人类-人类 $r = 0.474$)。我们认为,只有当对评判者评分子空间的直接几何检查通过时,LLM间一致性才应被视为人类对齐的证据;否则,共识反映的是坍缩子空间内的一致性。

英文摘要

LMs-as-judges are now standard, yet judges agree strongly with one another while agreeing only weakly with humans. We test whether this reflects shared signal or shared bias by measuring four geometric quantities on the standard LLM-as-judge stack across four community-built Indic datasets, eight Indic languages, and 41 LLM judges: score spread, effective rank, principal angle to the human subspace, and stacked correlations among judges and humans, all with bootstrap confidence intervals. On subjective rubrics, judges use less than half the human score range ($σ_J / σ_H \approx 0.3$--$0.5$). Their evaluation axis is nearly orthogonal to the human one and noticeably further from humans than humans are from each other ($87^\circ$--$89^\circ$ versus $78^\circ$--$81^\circ$). Inter-LLM agreement exceeds LLM--human agreement ($r_{LL} \approx 0.35$ versus $r_{LH} \approx 0.27$--$0.32$). On a rubric with a verifiable factual answer, the same diagnostics fall back into the human range (axis $58.5^\circ$; $r_{LH} = 0.519$). Fine-tuning and preference optimization recover spread ($0.32 \rightarrow 1.08$) but barely move the axis (still $87^\circ$--$88^\circ$). Only post-hoc calibration on a small human-anchored set improves all four community-health rubrics together, placing a calibrated 24B Indic judge ($r = 0.184$) ahead of GPT-5.5 ($r = 0.123$), yet still short of human reliability (human-human $r = 0.474$ on the verifiable rubric). We argue that inter-LLM agreement should be considered evidence of human alignment only when a direct geometric check on the judge's score subspace passes; otherwise, the consensus reflects agreement within a collapsed subspace.

2606.03040 2026-06-03 cs.AI cs.LG

RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases

RelGT-AC:用于关系数据库中自动完成任务的关系图Transformer

Phillip Jiang

AI总结 提出RelGT-AC模型,通过列掩码策略、统一任务头和TF-IDF文本编码器,在关系数据库的自动完成任务上优于GraphSAGE基线。

详情
Comments
12 pages, 6 figures. Code and model checkpoints available at https://github.com/jiangdmv/graph-transformer
AI中文摘要

关系数据库支撑着现代企业、科学和医疗系统,但由于其多表、异构和时间结构,对此类数据进行预测性机器学习仍然具有挑战性。关系深度学习(RDL)通过将数据库表示为异构图并直接应用图神经网络(GNN)来解决这一问题。RelBench v2最近引入了自动完成任务——一种实际动机的任务类型,其目标是从关系上下文中预测现有列值,类似于智能表单填充助手。我们提出了RelGT-AC(用于自动完成的关系图Transformer),通过三个有针对性的贡献扩展了RelGT架构:(1)一种列掩码策略,通过在子图编码期间屏蔽目标列来防止平凡解;(2)一个统一的任务头,支持在单个模型内进行二分类、多分类和回归自动完成任务;(3)一个TF-IDF文本编码器,自动检测和编码自由文本列,恢复分类编码器丢弃的强词汇信号。在跨越3个RelBench v2数据集(rel-trial、rel-f1、rel-stack)的7个任务中,RelGT-AC在所有3个回归自动完成任务上优于GraphSAGE基线,并通过TF-IDF编码器在文本密集的资格任务上实现了高达+10 AUROC点的提升。

英文摘要

Relational databases underpin modern enterprise, scientific, and healthcare systems, yet predictive machine learning on such data remains challenging due to their multi-table, heterogeneous, and temporal structure. Relational Deep Learning (RDL) addresses this by representing databases as heterogeneous graphs and applying graph neural networks (GNNs) directly. RelBench v2 recently introduced autocomplete tasks -- a practically motivated task type where the goal is to predict an existing column value from relational context, analogous to an intelligent form-filling assistant. We propose RelGT-AC (Relational Graph Transformer for Autocomplete), extending the RelGT architecture with three targeted contributions: (1) a column masking strategy that prevents trivial solutions by masking the target column during subgraph encoding; (2) a unified task head supporting binary classification, multiclass classification, and regression autocomplete tasks within a single model; and (3) a TF-IDF text encoder that automatically detects and encodes free-text columns, recovering strong lexical signal that categorical encoders discard. Across 7 tasks spanning 3 RelBench v2 datasets (rel-trial, rel-f1, rel-stack), RelGT-AC outperforms the GraphSAGE baseline on all 3 regression autocomplete tasks and achieves up to +10 AUROC points on text-heavy eligibility tasks via the TF-IDF encoder.

2606.03038 2026-06-03 cs.LG physics.comp-ph physics.optics

Will Accurate Fields Mislead Photonic Design? FromGlobal Accuracy to Port Readout

精确的场会误导光子设计吗?从全局精度到端口读出

Yitian Zhang, Yonghong chen, Youming Chen, Yiyang Li, Xing Zhe, Renhe Lu, Shaolin Liao, Yuzhe Ma, Zhong Guan

AI总结 针对光子设计中全局场精度高但端口读出不可靠的问题,提出传播对齐神经算子PaNO及其输出感知变体PaNO-R2,在MMI分束器基准上将端口功率误差降低72.7%。

详情
AI中文摘要

神经场代理可以加速光子设计循环,但一个在全局场误差上看起来精确的代理,当最终决策依赖于局部输出端口读出时,仍可能对候选器件进行错误排序。这种风险在传播主导的MMI分束器和耦合器中尤为严重,其中端口功率、分束、相位和耦合由累积的模态干涉和输出窗口聚合决定,而不仅仅是平均场相似性。我们通过场/中介/读出视角研究这种场到设计的不匹配,将密集复场误差与传播轮廓和输出窗口误差在端口聚合前分离。为了将代理与此链对齐,我们提出PaNO,一种传播对齐的神经算子,它保持全场预测接口,同时围绕局部边界结构、横向模态内容、轴向传播和交叉模态交互组织潜在状态。我们还评估了PaNO-R2,一种针对端口区域附近残余场分量的输出感知反馈变体。在具有4608个保留场的15波长可调谐$3\times3$ MMI基准上,PaNO将NeurOLight的端口功率误差从0.2018降低到0.0739,尽管cMAE略有升高,表明仅全局场精度不足以实现设计相关的读出保真度。PaNO-R2获得了最佳的cMAE、传播轮廓误差、输出轮廓误差和端口功率误差,将NeurOLight的端口功率和输出轮廓误差分别降低了72.7%和72.5%。

英文摘要

Neural field surrogates can accelerate photonic design loops, but a surrogate that looks accurate in global field error can still mis-rank candidate devices when the final decision depends on localized output-port readouts. This risk is acute in propagation-dominated MMI splitters and couplers, where port power, splitting, phase, and coupling are determined by accumulated modal interference and output-window aggregation rather than by average field similarity alone. We study this field-to-design mismatch through a Field/Mediator/Readout view that separates dense complex-field error from propagation-profile and output-window errors before port aggregation. To align the surrogate with this chain, we propose PaNO, a propagation-aligned neural operator that keeps the full-field prediction interface while organizing latent states around local boundary structure, transverse modal content, axial propagation, and cross-mode interaction. We also evaluate PaNO-R2, an output-aware feedback variant for residual field components near the port region. On a 15-wavelength tunable $3{\times}3$ MMI benchmark with 4608 held-out fields, PaNO lowers NeurOLight's port-power error from 0.2018 to 0.0739 despite slightly higher cMAE, showing that global field accuracy alone is not sufficient for design-relevant readout fidelity. PaNO-R2 attains the best cMAE, propagation-profile error, output-profile error, and port-power error, reducing NeurOLight's port-power and output-profile errors by 72.7\% and 72.5\%.

2606.03036 2026-06-03 cs.AI

TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment

TriEval: 一种资源高效的LLM偏见、毒性和真实性评估流水线

Akshatha Srikantha, Manpreet Singh, Yash Jajoo, Shyamal Lakhanpal

AI总结 提出TriEval流水线,通过同时评估LLM输出的偏见、毒性和真实性,在标准笔记本电脑上高效运行,并揭示开源与闭源模型在毒性和真实性上的差异。

详情
AI中文摘要

LLM已经从基本的聊天机器人演变为AI生态系统的支柱,现在广泛应用于医疗、学校和政府服务。LLM的领域范围采用需要持续评估以确保其安全性和公平性。部署LLM后遇到的常见问题包括不一致的输出和错误信息的幻觉。尽管存在许多LLM评估工具,但大多数仅限于一次测试单个参数,或者需要大多数研究人员无法访问的大量计算资源。TriEval通过同时评估LLM输出的多个参数(包括偏见、毒性和真实性)来解决这些挑战,同时最小化计算资源。该流水线与开源和闭源模型兼容,并在没有GPU集群的标准笔记本电脑上运行。TriEval已在四个模型上测试:Llama 3 8B、Mistral 7B、Gemma 2 9B和Claude Haiku。结果显示了开源和闭源模型之间的明显差异,特别是在毒性和真实性方面。TriEval作为开源发布,以使计算资源有限的研究人员能够更广泛地访问。

英文摘要

LLMs have evolved from basic chatbots to the backbone of the AI ecosystem, now widely used in healthcare, schools, and government services. The domain-wide adoption of LLMs necessitates continuous evaluation to ensure their safety and fairness. Common issues encountered after deploying LLMs include inconsistent outputs and hallucinations of incorrect information. Although numerous LLM evaluation tools exist, most are limited to testing a single parameter at a time or require massive computational resources that are not accessible to most researchers. TriEval addresses these challenges by evaluating LLM outputs across multiple parameters, including bias, toxicity, and truthfulness together, while minimizing computing resources. The pipeline is compatible with both open- and closed-source models and runs on a standard laptop without a GPU cluster. TriEval has been tested on four models: Llama 3 8B, Mistral 7B, Gemma 2 9B, and Claude Haiku. The results show clear differences between open-source and closed-source models, especially in terms of toxicity and truthfulness. TriEval is being released as open source to enable broader access for researchers with limited computational resources.

2606.03034 2026-06-03 cs.MA cs.AI

Capability Advertisement as a Market for Lemons: A Trust Layer for Heterogeneous Agent Networks

能力广告作为柠檬市场:异构智能体网络的信任层

Gaurav Naresh Mittal

AI总结 针对LLM智能体网络中的能力虚假声称问题,提出基于柠檬市场理论的信任层,通过概率描述、筛选和声誉机制实现可信委托。

详情
AI中文摘要

大型语言模型(LLM)智能体已开始相互委托工作。诸如模型上下文协议(MCP)和智能体间协议(A2A)等协议允许智能体发布其能力并允许其他智能体调用,且此类智能体的公共注册表已经出现。这些协议假设所广告的能力是静态的、真实的事实。然而,真实的智能体并非如此:其能力是概率性的,随输入变化,在底层模型更新时漂移,并且由于智能体本身是语言模型,它可以完全自信地描述自己却可能是错误的。因此,调用者看到的是智能体声称能做什么,而非实际能做什么,且没有原则性的方法区分可靠提供者和流利的冒名顶替者。我们认为这些困难有一个共同原因:柠檬市场。当质量隐藏且声称成本低廉时,好与坏的提供者变得难以区分,诚实的可靠性得不到回报,市场向最差参与者退化。经济学提供了三种补救措施:信号传递、筛选和声誉,而这些在当今的智能体协议中均不存在。我们做出四项贡献:(1)一个故障分类,将自信-错误命名为非对抗性的、相关的拜占庭故障子类,而经典容错模型对此建模不当;(2)一个柠檬市场模型,表明基于信仰的协议仅允许低信任均衡;(3)信任层,一个轻量级、协议无关的窄腰,位于MCP和A2A之上,添加概率能力描述、筛选和声誉,并在维持过度声称的成本超过其收益时允许分离均衡;(4)一个针对委托链的可靠性组合界限,具有端到端放置论证。该设计无需模型重新训练,并在其信任锚缺失或损坏时优雅降级。

英文摘要

Large language model (LLM) agents have begun to delegate work to one another. Protocols such as the Model Context Protocol (MCP) and the Agent2Agent protocol (A2A) let an agent publish what it can do and let others call it, and public registries of such agents are already appearing. These protocols assume an advertised capability is a static, truthful fact. A real agent is none of these things: its competence is probabilistic, varies with input, drifts when the underlying model is updated, and, because the agent is itself a language model, it can describe itself with complete confidence and be wrong. A caller therefore sees what an agent claims to do, not what it can do, with no principled way to tell a reliable provider from a fluent impostor. We argue these difficulties share one cause: the market for lemons. When quality is hidden and claims are cheap, good and bad providers become indistinguishable, honest reliability goes unrewarded, and the market decays toward its worst participants. Economics offers three remedies, signaling, screening, and reputation, and none are present in today's agent protocols. We make four contributions: (1) a failure taxonomy that names confident-wrong as a non-adversarial, correlated subclass of Byzantine faults that classical fault-tolerance mismodels; (2) a market-for-lemons model showing that faith-based protocols admit only a low-trust equilibrium; (3) the Trust Layer, a thin, protocol-agnostic narrow waist above MCP and A2A that adds probabilistic capability descriptors, screening, and reputation, and admits a separating equilibrium when the cost of sustaining an overclaim exceeds the gain from it; and (4) a reliability-composition bound for delegation chains with an end-to-end placement argument. The design needs no model retraining and degrades gracefully when its trust anchors are absent or corrupt.

2606.03032 2026-06-03 cs.CL

The Deliberative Illusion: Diagnosing Factual Attrition and Stance Homogenization in Multi-Agent LLM Deliberation

深思幻觉:诊断多智能体LLM讨论中的事实损耗与立场同质化

Herun Wan, Jiaying Wu, Minnan Luo, Fanxiao Li, Ningnan Wang, Nancy F. Chen, Min-Yen Kan

AI总结 本文提出DelibTrace框架,通过分解原子事实并追踪其存留,发现多智能体LLM讨论中高达72%的关键事实丢失,导致立场同质化,揭示了“同意越多、知道越少”的风险。

详情
AI中文摘要

多智能体LLM系统常将共识视为成功交互的证据。然而,对于深思性问题,可靠性取决于智能体是否保留了理解问题所需的事实和观点。我们识别出深思幻觉:讨论导致(1)事实损耗,即问题关键事实的逐渐丢失,以及(2)立场同质化,即不同立场向共识的崩溃。为了衡量这一过程,我们引入了DelibTrace框架,该框架将每个问题分解为原子事实,标记关键事实,将其分布在智能体之间,并追踪它们在讨论轮次中的存留。在涉及三个代表性LLM家族的伦理和新闻类深思中,多智能体讨论抹去了高达72%的关键事实。这种损失是严重的:保留的证据可能误导性地重构问题,最终立场仍锚定于基础模型先验,且单个恶意智能体可将错误信息注入不断缩小的共享上下文。这些结果揭示了一个更尖锐的风险:智能体可以在知道更少的情况下达成更多一致。我们呼吁进行评估,衡量哪些事实、不确定性和合理分歧在交互中得以存留。

英文摘要

Multi-agent LLM systems often treat consensus as evidence of successful interaction. For deliberative problems, however, reliability depends on whether agents preserve the facts and viewpoints needed to interpret an issue. We identify the deliberative illusion: discussion produces (1) factual attrition, the progressive loss of issue-critical facts, alongside (2) stance homogenization, the collapse of diverse positions toward consensus. To measure this process, we introduce DelibTrace, a framework that decomposes each issue into atomic facts, labels issue-critical ones, distributes them across agents, and tracks their survival across discussion rounds. Across ethical and news-based deliberation with three representative LLM families, multi-agent discussion erases up to 72% of issue-critical facts. This loss is consequential: retained evidence can reconstruct the issue misleadingly, final stances remain anchored in base-model priors, and a single malicious agent can inject misinformation into the shrinking shared context. These results reveal a sharper risk: agents can agree more while knowing less. We call for evaluations that measure which facts, uncertainties, and legitimate disagreements survive interaction.

2606.03031 2026-06-03 cs.AI cs.MA cs.SC

AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification

AUDITFLOW:用于结构化财务报告验证的可执行符号环境

Yan Wang, Xuguang Ai, Jaisal Patel, Xueqing Peng, Fengran Mo, Yupeng Cao, Haohang Li, Mingyu Cao, Lingfei Qian, Víctor Gutiérrez-Basulto

AI总结 提出基于图的多智能体框架AuditFlow,通过构建符号环境分离自适应搜索与确定性验证,在财务审计验证任务中实现82.09%的联合审计准确率。

详情
AI中文摘要

结构化财务审计验证对语言模型代理而言是困难的,因为正确性依赖于结构化证据而非仅文本。模型必须将报告事实链接到分类概念,遍历计算或维度关系,并在应用审计规则前重新计算预期值。我们提出AuditFlow,一个基于图的多智能体框架,将自适应搜索与确定性验证分离。AuditFlow从静态US-GAAP分类图和动态XBRL申报图构建符号环境,并通过类型化工具暴露事实检索、分类遍历、数值检查和规则评估功能。两名初级审计员从监管和证据角度检查每个案例,而高级审计员解决分歧并可要求进一步调查。最终报告通过证据聚合融合,产生审计结论、预期值、证据链和可信度评分。在基于FinAuditing的FinMR样本上,AuditFlow在GPT-5.5下达到82.09%的联合审计准确率,超过最强基线14.93个百分点。移除确定性检查后准确率降至17.91%,表明符号环境执行了模型无法可靠替代的验证步骤。

英文摘要

Structured financial audit verification is difficult for language-model agents because correctness depends on structured evidence rather than text alone. A model must link reported facts to taxonomy concepts, traverse calculation or dimensional relations, and recompute expected values before applying an audit rule. We propose AuditFlow, a graph-grounded multi-agent framework that separates adaptive search from deterministic verification. AuditFlow builds a symbolic environment from a static US-GAAP taxonomy graph and a dynamic XBRL filing graph, and exposes it through typed tools for fact retrieval, taxonomy traversal, numerical checking, and rule evaluation. Two junior auditors inspect each case from regulatory and evidentiary views, while a senior auditor resolves disagreements and can request further investigation. The final reports are fused through evidential aggregation to produce an audit verdict, expected value, evidence trail, and trustworthiness score. On a FinAuditing-derived FinMR sample, AuditFlow reaches 82.09% joint audit accuracy under GPT-5.5, outperforming the strongest baseline by 14.93 points. Removing deterministic checks drops accuracy to 17.91%, showing that the symbolic environment performs the verification step that the model cannot reliably replace.

2606.03029 2026-06-03 cs.CL cs.AI

Conditional Hypothesis Generation for LLM-Based Text Analysis with Researcher-Specified Covariates

基于研究者指定协变量的条件假设生成用于LLM文本分析

Paiheng Xu, Jing Liu, Wei Ai

AI总结 提出条件假设生成框架,通过纳入研究者指定的协变量来引导LLM发现相关子组内而非全局的差异模式,并采用特征-协变量交互和分层内去均值与逆频率重加权两种方法解决子组不平衡和符号反转问题。

详情
AI中文摘要

计算社会科学的一个核心目标是发现语言如何随感兴趣的结果(如政治派别或教学质量)变化的可解释差异。最近的基于LLM的假设生成方法用自然语言描述这些差异,但选择的是全局判别模式,而没有考虑基于研究者领域知识塑造数据的协变量。当忽略协变量时,所选模式可能反映混杂因素而非实质性感兴趣的差异。我们引入了条件假设生成,这是一个纳入研究者指定协变量的框架,以将假设发现引导至相关子组内成立的差异。出现了两个挑战:目标子组可能代表性不足(分层不平衡),并且差异的方向可能在子组间反转(符号反转)。我们提出了两种受计量经济学启发的方法:一种引入特征-协变量交互以检测符号反转,另一种应用分层内去均值和逆频率重加权以平衡代表性不足的分层。合成实验表明,每种方法在其目标设置中均优于全局基线,对两个真实世界数据集的专家评估证实,协变量感知的生成在相关子组内产生了更有用的假设。

英文摘要

A core goal of computational social science is to discover interpretable differences in how language varies across outcomes of interest, such as political affiliation or instructional quality. Recent LLM-based hypothesis generation methods describe such differences in natural language, but select for globally discriminative patterns without accounting for covariates that shape the data based on researchers' domain knowledge. When covariates are ignored, selected patterns can reflect confounds rather than differences of substantive interest. We introduce conditional hypothesis generation, a framework that incorporates researcher-specified covariates to steer hypothesis discovery toward differences that hold within relevant subgroups. Two challenges arise: the target subgroup may be underrepresented (stratum imbalance), and the direction of a difference may reverse across subgroups (sign reversal). We propose two econometrics-inspired methods: one introduces feature--covariate interactions to detect sign reversals, and the other applies within-stratum demeaning and inverse-frequency reweighting to equalize underrepresented strata. Synthetic experiments show each method outperforms global baselines in its targeted setting, and expert evaluation on two real-world datasets confirms that covariate-aware generation surfaces more useful hypotheses within relevant subgroups.

2606.03028 2026-06-03 cs.SD

Audio Spotforming via Post-Filtering Using Cross-Array Non-target Estimates

通过跨阵列非目标估计的后滤波实现音频点形成

Yuto Ishikawa, Li Li, Shogo Seki, Kouei Yamaoka

AI总结 提出一种利用跨阵列非目标估计进行后滤波的新方法,以替代低秩近似,提升从噪声混合中提取目标语音的性能。

详情
Comments
Accepted for EUSIPCO 2026
AI中文摘要

音频点形成是一种通过利用多个麦克风阵列从噪声混合中提取目标语音的技术。传统方法通过低秩近似从每个阵列获得的线性分离信号中估计共享的目标语音成分,并基于该估计的低秩表示应用后滤波。然而,由于低秩模型与语音信号复杂结构之间的不匹配,直接依赖低秩近似进行后滤波会降低语音提取性能。在本研究中,我们利用一个观察:从一个阵列视角位于目标语音方向上的非目标成分,当从其他阵列观察时可以在空间上分离。这一见解激发了一种新的点形成方法,该方法利用跨阵列的非目标估计而非依赖低秩近似来实现高效的后滤波估计。实验表明,所提方法优于传统的点形成方法。

英文摘要

Audio spotforming is a technique for extracting target speech from noisy mixtures by utilizing multiple microphone arrays. Conventional methods estimate a shared target speech component from linearly separated signals obtained by each array using low-rank approximations and apply post filtering (PF) based on this estimated low-rank representation. However, owing to the mismatch between low-rank models and the complex structure of speech signals, directly relying on low-rank approximations for PF can degrade the speech extraction performance. In this study, we leverage the observation that non-target components located in the target speech direction from the perspective of one array can be spatially separated when viewed from other arrays. This insight motivates a new spotforming method for efficient post-filter estimation using non-target estimates across arrays instead of relying on low-rank approximations. Experiments demonstrate that the proposed method outperforms conventional spotforming methods.

2606.03027 2026-06-03 cs.CL

SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

SEA-Embedding:面向东南亚的开放且可复现的文本嵌入

Peerat Limkonchotiwat, Raymond Ng, Sarana Nutanong, Jian Gang Ngui

AI总结 提出SEA-Embedding,一个完全开放且可复现的东南亚语言文本嵌入管道,通过研究数据组成、训练目标和基础编码器初始化三个核心因素,在SEA-BED上取得最优结果。

详情
AI中文摘要

文本嵌入对许多下游应用至关重要,因此鲁棒性对于现实世界的NLP很重要。然而,最近最先进的嵌入模型大多不可复现,因为它们依赖于封闭或未公开的训练数据,并且对于东南亚语言来说仍然不够鲁棒。我们提出了SEA-Embedding,一个完全开放且可复现的东南亚语言文本嵌入管道,仅使用公开可用的数据进行训练,并用它来研究鲁棒嵌入设计的三个核心因素:数据组成、训练目标和基础编码器初始化。SEA-Embedding在SEA-BED上取得了最先进的结果,同时能够对该地区的鲁棒文本嵌入进行系统且可复现的分析。

英文摘要

Text embeddings are fundamental to many downstream applications, making robustness important for real-world NLP. However, most recent state-of-the-art embedding models are not reproducible because they rely on closed or undisclosed training data, and they remain insufficiently robust for Southeast Asian languages. We present SEA-Embedding, a fully open and reproducible text-embedding pipeline for Southeast Asian languages trained only on publicly available data, and use it to study three core factors of robust embedding design: data composition, training objective, and base encoder initialization. SEA-Embedding achieves state-of-the-art results on SEA-BED while enabling systematic and reproducible analysis of robust text embeddings for the region.

2606.03026 2026-06-03 cs.NE cs.AI cs.LG

Spike-Aware C++ INT8 Inference for Sparse Spiking Language Models on Commodity CPUs

面向稀疏脉冲语言模型在商用CPU上的脉冲感知C++ INT8推理

Ting Liu

AI总结 本文提出一种脉冲感知的C++推理运行时,利用稀疏二进制脉冲状态作为执行原语,结合混合布局、AVX2/FMA内核和INT8量化,在商用CPU上实现脉冲语言模型的高效解码,吞吐量优于同等规模稠密模型但质量略逊。

详情
Comments
11 pages, 7 tables
AI中文摘要

脉冲语言模型展现出激活稀疏性,而稠密Transformer运行时无法直接利用。本文从系统角度研究这一特性。基于SymbolicLight V1脉冲门控语言模型家族,我们实现了一个C++ CPU推理运行时,将稀疏二进制脉冲状态视为执行原语,而非仅应用事后权重压缩。该运行时结合了清单驱动的权重加载器、混合行/列内存布局、AVX2/FMA内核、每通道对称INT8量化以及脉冲条件稀疏路径的整数域累加。在AMD Ryzen 7 5800X上,早期标量FP32基线解码速度为9.5 tokens/s。混合布局AVX2 FP32将其提升至14.7 tokens/s,而AVX2 INT8在相同step-30k导出模型上达到19.9 tokens/s,同时将权重占用从3.49 GB降至1.06 GB。对于可用的186k步874M参数INT8导出模型,C++运行时在单线程CPU基准测试中解码速度为22.63 tokens/s,相比之下,TinyLlama-1.1B Q8_0为16.31 tokens/s,Falcon3-1B Q8_0为11.26 tokens/s,Qwen2.5-1.5B Q8_0为9.70 tokens/s。线程扩展在四个CPU线程时达到47.90 tokens/s,512 token预填充从单线程的29.86 tokens/s提升至八线程的94.68 tokens/s。吞吐量提升伴随着质量代价:SNN报告WikiText-2困惑度为24.80,差于同一基准中的稠密基线。我们将结果定位为稀疏语言运行时的推理系统研究,长期动机在于可能受益于传感器和执行器附近本地低核推理的具身和边缘智能体。脉冲感知执行可以改善稀疏脉冲语言模型的CPU吞吐量和内存行为,而模型质量、受控稠密训练基线、具身任务评估和测量CPU能耗仍是开放问题。

英文摘要

Spiking language models expose activation sparsity that dense Transformer runtimes do not directly exploit. This paper studies that property from a systems perspective. Building on the SymbolicLight V1 spike-gated language model family, we implement a C++ CPU inference runtime that treats sparse binary spike states as an execution primitive rather than only applying post-hoc weight compression. The runtime combines a manifest-driven weight loader, mixed row/column memory layout, AVX2/FMA kernels, per-channel symmetric INT8 quantization, and integer-domain accumulation for spike-conditioned sparse paths. On an AMD Ryzen 7 5800X, an early scalar FP32 baseline decodes at 9.5 tokens/s. Mixed-layout AVX2 FP32 raises this to 14.7 tokens/s, and AVX2 INT8 reaches 19.9 tokens/s on the same step-30k export while reducing the weight footprint from 3.49 GB to 1.06 GB. For the available 186k-step 874M-parameter INT8 export, the C++ runtime decodes at 22.63 tokens/s in a single-thread CPU benchmark, compared with 16.31 tokens/s for TinyLlama-1.1B Q8_0, 11.26 tokens/s for Falcon3-1B Q8_0, and 9.70 tokens/s for Qwen2.5-1.5B Q8_0 under llama.cpp. Thread scaling reaches 47.90 tokens/s at four CPU threads, and 512-token prefill improves from 29.86 to 94.68 tokens/s from one to eight threads. The throughput result comes with a quality cost: the SNN reports WikiText-2 perplexity 24.80, worse than the dense baselines in the same benchmark. We frame the result as an inference-systems study for sparse language runtimes, with longer-term motivation in embodied and edge agents that may benefit from local, low-core inference near sensors and actuators. Spike-aware execution can improve CPU throughput and memory behavior for sparse spiking language models, while model quality, controlled dense training baselines, embodied-task evaluation, and measured CPU energy remain open problems.

2606.03022 2026-06-03 cs.CL cs.AI

Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization

幻觉作为正交噪声:通过动态上下文正交化实现推理时流形对齐

Mingkuan Zhao, Wentao Hu, Tianchen Huang, Yuheng Min, Suquan Chen, Yide Gao, Yanbo Zhai, Shuangyong Song, Xuelong Li

AI总结 提出一种基于线性表示假设的几何框架,将大语言模型幻觉解释为残差流语义流形的正交噪声,并引入推理时干预方法动态上下文正交化(DCO),通过层间Z分数抑制机制选择性地衰减异常正交分量,在保持知识记忆的同时提升上下文忠实度。

详情
AI中文摘要

大语言模型(LLMs)中的幻觉——即生成与上下文事实或逻辑约束不一致的内容——仍然是可靠部署面临的持续挑战。在这项工作中,我们通过基于线性表示假设的几何框架来解决这个问题。我们提出,幻觉表现为相对于残差流语义流形的正交噪声。具体来说,我们假设虽然注意力头理想地传播与上下文子空间一致的信息,但当特定头引入与该子空间正交的分量时,就会产生幻觉,破坏潜在表示的一致性。基于这一表述,我们引入了动态上下文正交化(DCO),一种推理时干预方法。DCO利用输入残差流作为动态上下文锚点,对注意力头输出进行正交分解。为了区分上下文对齐的语义更新和发散噪声,DCO采用层间Z分数抑制机制,根据统计分布选择性地衰减异常正交分量。在XSum、NQ-Swap和IFEval等基准上对Llama-3-8B和70B的评估表明,与最先进的干预基线相比,DCO实现了更优的上下文忠实度。此外,DCO在TriviaQA和TruthfulQA等知识密集型任务上保持高性能,有效缓解了现有方法中常见的幻觉抑制与参数知识保留之间的权衡。我们的发现验证了幻觉的几何解释,并将DCO确立为一种计算高效的流形对齐方法。代码可在https://this https URL获取。

英文摘要

Hallucination in Large Language Models (LLMs), characterized by the generation of content inconsistent with contextual facts or logical constraints -- remains a persistent challenge for reliable deployment. In this work, we address this issue through a geometric framework rooted in the linear representation hypothesis. We propose that hallucinations manifest as orthogonal noise relative to the semantic manifold of the residual stream. Specifically, we hypothesize that while attention heads ideally propagate information congruent with the context subspace, hallucinations arise when specific heads introduce components orthogonal to this subspace, disrupting the coherence of the latent representation. Based on this formulation, we introduce Dynamic Contextual Orthogonalization (DCO), an inference-time intervention method. DCO utilizes the input residual stream as a dynamic context anchor to perform orthogonal decomposition on attention head outputs. To distinguish between context-aligned semantic updates and divergent noise, DCO employs a layer-wise Z-score suppression mechanism that selectively attenuates outlier orthogonal components based on statistical distributions. Evaluations on Llama-3-8B and 70B across benchmarks such as XSum, NQ-Swap, and IFEval demonstrate that DCO achieves superior contextual faithfulness compared to state-of-the-art intervention baselines. Furthermore, DCO maintains high performance on knowledge-intensive tasks like TriviaQA and TruthfulQA, effectively mitigating the trade-off between hallucination suppression and parametric knowledge retention often observed in existing methods. Our findings validate the geometric interpretation of hallucinations and establish DCO as a computationally efficient approach for enforcing manifold alignment.Our code is available at https://github.com/Harry-Miral/DCO

2606.03021 2026-06-03 cs.CL

Hint-Guided Diversified Policy Optimization for LLM Reasoning

提示引导的多样化策略优化用于大语言模型推理

Zhiyu Cao, Kaixin Wu, Mingjie Zhong, Peifeng Li, Xiaobo Li, Can Ye, Qiaoming Zhu

AI总结 提出提示引导的多样化策略优化(HDPO),通过“提出-选择-思考”轨迹激励模型生成多样且可靠的解决方案,提升推理能力、候选方案多样性和可靠方案识别能力。

详情
AI中文摘要

大语言模型(LLMs)的最新发展展示了令人印象深刻的推理能力,其中可验证奖励的强化学习(RLVR)是一种有前景的增强策略。然而,现有的奖励机制局限于结果层面的正确性,缺乏明确的信号来引导模型考虑多样化的解决方案。相比之下,人类问题解决通常涉及评估多种潜在方法并选择最可靠的解决方案,而当前的RLVR框架并未明确激励这种认知过程。受此启发,我们提出了提示引导的多样化策略优化(HDPO),允许模型首先列出所有潜在的候选解决方案大纲作为提示,然后选择最可靠的一个进行进一步推理。HDPO包括两个阶段:结构化推理的冷启动和提示引导的多样化强化学习,以激励模型遵循“提出-选择-思考”轨迹生成多样且可靠的解决方案。实验结果表明,HDPO有效提升了LLM的推理能力,增强了候选解决方案的多样性以及LLM识别可靠解决方案的能力。

英文摘要

Recent developments in Large Language Models (LLMs) have showcased impressive reasoning capabilities, with Reinforcement Learning with Verifiable Rewards (RLVR) being a promising enhancement strategy. However, existing reward mechanisms are constrained to the outcome-level correctness and lack explicit signals to guide the model to consider diverse solutions. In contrast, human problem solving typically involves evaluating multiple potential approaches and selecting the most reliable solution, a cognitive process that current RLVR frameworks do not explicitly incentivize. Inspired by this, we propose Hint-Guided Diversified Policy Optimization (HDPO), allowing the model to first list all potential candidate solution outlines as hints and then select the most reliable one for further reasoning. HDPO comprises two stages of Cold Start for Structured Reasoning and Hint-Guided Diversified Reinforcement Learning to incentivize the model to generate diverse and reliable solutions following the ``propose-select-think'' trajectory. Experimental results show that HDPO effectively boosts LLM reasoning and enhances the diversity of candidate solutions as well as the LLM's ability to identify reliable solutions.