arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3839
专题追踪
2606.07589 2026-06-09 cs.LG 新提交

Optimality of Sequential Filtering Under Independent Cost and Selectivity Models

独立成本与选择性模型下顺序过滤的最优性

Hrishikesh Paranjape, Abhishek Mandal, Xian Sun

发表机构 * IEEE International Conference on Electro/Information Technology (EIT 2026)(IEEE国际电子/信息科技会议(EIT 2026))

AI总结 针对顺序过滤管道,在独立模型下证明按成本与拒绝概率递增比率排序可最小化期望总成本,并通过蒙特卡洛模拟验证其优于常见启发式方法。

Comments 2 pages, 2 figures. Accepted at the 2026 IEEE International Conference on Electro/Information Technology (EIT 2026)

详情
AI中文摘要

顺序过滤管道是大规模系统中的常见设计模式,其中大量物品通过一系列每个阶段产生成本的阶段逐步减少。尽管在排序系统、级联机器学习推理和欺诈检测中普遍存在,过滤排序通常由启发式方法决定而没有正式保证。我们在期望成本目标下形式化了顺序过滤,并证明在独立模型下,按成本与拒绝概率递增比率排序过滤器可最小化期望总成本。广泛的蒙特卡洛模拟表明,最优排序在所有运行中严格优于常见启发式方法,无论是在期望上还是在结果的完整分布上。

英文摘要

Sequential filtering pipelines are a common design pattern in large-scale systems, where a large population of items is progressively reduced by a sequence of stages that each incur cost. Despite their prevalence in ranking systems, cascaded machine learning inference, and fraud detection, filter ordering is often determined by heuristics without formal guarantees. We formalize sequential filtering under an expected-cost objective and prove that, under an independence model, ordering filters by increasing ratio of cost to rejection probability minimizes expected total cost. Extensive Monte Carlo simulations show that the optimal ordering strictly dominates common heuristics across all runs, both in expectation and across the full distribution of outcomes.

2606.07587 2026-06-09 cs.LG 新提交

The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM Routers

路由平台:理解并突破LLM路由器的准确性极限

Yifan Lu, Qiyue Zhang, Shenrun Zhang, Zhibo Yu, Zhuang Wang, Hanjie Chen, Jiarong Xing

发表机构 * Rice University(莱斯大学) Amazon(亚马逊)

AI总结 研究发现多种LLM路由方法存在“路由平台”现象,即准确性趋同且远低于理想路由器,主要原因是可预测性瓶颈;通过增大训练数据、更强编码器和端到端微调可突破平台。

Comments 23 Pages, 12 Tables, 9 Figures

详情
AI中文摘要

LLM路由已成为一种流行的方法,通过为每个查询动态选择模型来改善LLM服务的成本-质量权衡。最近的工作探索了广泛的路由方法,包括基于聚类的路由器、学习分类器、成对排序和基于置信度的方法。我们对五个基准测试中的21种路由方法的广泛研究揭示了一个一致的现象,我们称之为路由平台:许多方法,包括kNN,实现了非常相似的准确性,并收敛到一个狭窄的性能范围,远低于理想路由器。我们的研究表明,平台主要是由可预测性瓶颈引起的:当前路由器主要学习全局平均模型性能趋势,而不是细粒度的查询特定路由信号。因此,它们解决了重叠的简单查询,但共同在需要实例特定路由决策的困难查询上失败。我们进一步研究如何超越平台,发现更大的训练数据集、更强的编码器和端到端微调可以进一步提高路由准确性。这些发现表征了当前路由方法的常见限制,并为社区构建更有效的路由系统提供了见解和可操作的方向。

英文摘要

LLM routing has become a popular approach to improve the cost-quality trade-off of LLM services by dynamically selecting a model for each query. Recent work has explored a broad range of routing methods, including clustering-based routers, learned classifiers, pairwise ranking, and confidence-based approaches. Our extensive study of 21 routing methods across five benchmarks reveals a consistent phenomenon that we call the routing plateau: many methods, including kNN, achieve very similar accuracy and converge to a narrow performance range that remains far below the oracle router. Our investigation shows that the plateau is largely caused by a predictability bottleneck: current routers mainly learn global averaged model-performance trends rather than fine-grained query-specific routing signals. As a result, they solve overlapping easy queries but collectively fail on hard queries that require instance-specific routing decisions. We further study how to move beyond the plateau and find that larger training datasets, stronger encoders, and end-to-end fine-tuning can further improve routing accuracy. These findings characterize the common limits of current routing methods and provide insights and actionable directions for the community to build more effective routing systems.

2606.07585 2026-06-09 cs.CV cs.AI 新提交

Multimodal Group Emotion Recognition In-the-Wild Towards a Privacy-Safe Non-Individual Approach

面向隐私安全的非个体化方法的多模态群体情绪识别

Anderson Augusma

发表机构 * Université Grenoble Alpes(格勒诺布尔-阿尔卑斯大学) Univ. Grenoble Alpes(格勒诺布尔-阿尔卑斯大学) Univ. of Glasgow(格拉斯哥大学) Inria(法国国家信息与自动化研究所) Univ. Paris-Saclay(巴黎-萨克雷大学) TU Delft(代尔夫特理工大学)

AI总结 本文提出两种多模态框架(交叉注意力融合+帧注意力池化,以及变分编码器多解码器),利用集体音视频信号进行群体情绪识别,避免使用个体特征,在保护隐私的同时实现鲁棒性能。

Comments Doctoral thesis

详情
AI中文摘要

本论文研究野外环境下的群体情绪识别(GER),重点关注隐私保护。与依赖面部、目光或语音分析等个体层面线索的传统情绪识别方法不同,本工作利用集体音视频信号推断群体层面的情绪,降低个体监控和监视的风险。提出了两个互补框架。第一个是用于音视频融合的交叉注意力多模态架构,结合帧注意力池化(FAP)进行时间聚合。该框架由合成数据增强支持,并通过消融研究验证,在真实世界GER条件下展现出鲁棒性。第二个框架,变分编码器多解码器(VE-MD),学习一个共享潜在空间,用于情绪分类和结构表示预测(包括身体和面部线索)。探索了两种解码策略(基于DETR和基于热图),以分析结构表示在群体和个体设置中的作用。本论文做出三项主要贡献:阐明了多模态和结构线索在群体层面情感计算中的作用;引入了两种用于隐私保护多模态GER的架构;并证明了在不使用个体特征作为输入数据的情况下可以实现有竞争力的性能。

英文摘要

This thesis addresses group emotion recognition (GER) in-the-wild with a focus on privacy preservation. Unlike traditional emotion recognition methods that rely on individual-level cues such as face, gaze, or voice analysis, this work uses collective audio-video signals to infer emotions at the group level, reducing risks of individual monitoring and surveillance. Two complementary frameworks are proposed. The first is a cross-attention multimodal architecture for audio-video fusion, combined with Frames Attention Pooling (FAP) for temporal aggregation. It is supported by synthetic data augmentation and validated through ablation studies, demonstrating robustness in real-world GER conditions. The second framework, Variational Encoder Multi-Decoder (VE-MD), learns a shared latent space for emotion classification and structural representation prediction, including body and face cues. Two decoding strategies, DETR-based and heatmap-based, are explored to analyze the role of structural representations in group and individual settings. The thesis makes three main contributions: it clarifies the role of multimodality and structural cues in group-level affective computing; introduces two architectures for privacy-preserving multimodal GER; and shows that competitive performance can be achieved without using individual features as input data.

2606.07583 2026-06-09 cs.LG cs.AI 新提交

Outage Detection in Self-Healing Smart Grids Using Reinforcement Learning with Spectral Graph Neural Networks

基于频谱图神经网络强化学习的自愈智能电网故障检测

Lihui Liu, Mucun Sun, Caisheng Wang

发表机构 * Wayne State University(韦恩州立大学) University of Texas at Dallas(德克萨斯大学达拉斯分校)

AI总结 提出频谱图强化学习框架,利用频谱图神经网络学习最优恢复策略,实现配电网故障实时近最优管理,在三个IEEE测试系统上验证了泛化能力。

详情
AI中文摘要

自愈智能电网能够在故障期间快速调整其网络配置,以最小化电力中断。在故障期间,可以采取多种措施,例如通过开关操作进行网络重构和紧急甩负荷。然而,传统的用于故障缓解的机器学习方法由于响应速度慢和计算成本高,不适用于智能电网。为了解决这些挑战,最近的研究探索了使用强化学习自动执行网络重构。在这些方法中,控制策略通常使用图神经网络(GNN)建模。然而,传统的GNN在空间域中运行,可能无法捕捉频域中的重要关系。频域信息对于建模电力网络中的全局结构模式和系统范围交互特别有用。在本文中,我们提出了一种用于配电网故障管理的频谱图强化学习框架,以增强系统韧性。我们的模型使用频谱图神经网络学习最优电力恢复策略。我们在三个修改后的IEEE测试系统上评估了所提出的方法:13节点、34节点和123节点网络。实验结果表明,我们的方法在实时性上达到了接近最优的性能,并且在广泛的故障场景中具有良好的泛化能力。

英文摘要

Self-healing smart grids can quickly adjust their network configuration during outages to minimize power disruptions. During an outage, several actions can be taken, such as network reconfiguration through switching operations and emergency load shedding. However, traditional machine learning methods for outage mitigation are not well suited for smart grids due to their slow response time and high computational cost. To address these challenges, recent studies have explored reinforcement learning to automatically perform network reconfiguration. In these approaches, the control policy is typically modeled using a graph neural network (GNN). However, conventional GNNs operate in the spatial domain and may fail to capture important relationships in the frequency domain. Frequency-domain information is particularly useful for modeling global structural patterns and system-wide interactions in power networks. In this paper, we propose a spectral graph reinforcement learning framework for outage management in distribution networks to enhance system resilience. Our model learns the optimal power restoration policy using a spectral graph neural network. We evaluate the proposed method on three modified IEEE test systems: the 13-bus, 34-bus, and 123-bus networks. Experimental results show that our approach achieves near-optimal performance in real time and generalizes well across a wide range of outage scenarios.

2606.07582 2026-06-09 cs.LG cs.AI cs.ET 新提交

Customer Churn Prediction on Structured Data Using FT-Transformer and Stacking Ensembles

基于FT-Transformer和堆叠集成的结构化数据客户流失预测

Joyjit Roy, Samaresh Kumar Singh, Laxmi Shaw

发表机构 * Independent Researcher, Austin, TX, USA(独立研究员,美国德克萨斯州奥斯汀) Independent Researcher, Leander, TX(独立研究员,美国德克萨斯州利安德) Texas A & M University-Victoria, Victoria, TX(德克萨斯农工大学维多利亚分校)

AI总结 提出一种结合FT-Transformer与XGBoost的混合架构,通过校准感知堆叠集成处理类别不平衡和特征交互,在银行客户流失数据集上F1达62.10%,AUC-ROC为0.861。

Comments 22 pages, 9 figures, 20 tables; published in IEEE Access

Journal ref IEEE Access, vol. 14, pp. 62834-62855, 2026

详情
AI中文摘要

客户流失预测在保险、数字银行、电子商务和订阅平台等数据驱动行业中至关重要,因为保留现有客户通常比获取新客户更具成本效益。由于类别不平衡、非线性特征交互和异质特征类型,在结构化数据集上预测流失仍然具有挑战性。基于树的集成方法在这些场景中始终表现出强大的性能,通常优于传统神经网络。本研究引入了一种经过验证的混合架构,通过校准感知堆叠将特征标记化变换器(FT-Transformer)与梯度提升树相结合。所提出的框架解决了先前研究中在统计验证、概率校准和可重复性方面的持续空白。FT-Transformer利用自注意力捕获高阶特征交互,而XGBoost通过互补的归纳偏置捕获梯度提升决策边界。类别不平衡通过使用类别加权损失函数处理,从而避免合成过采样并保留少数类分布。模型使用基于折叠外(OOF)堆叠的逻辑回归元学习器进行集成,该元学习器重新校准过于自信的基模型输出并学习最优组合权重。在一个公开的银行流失数据集上,混合模型在5x5交叉验证下达到62.10%的F1、0.861的AUC-ROC和0.647的PR-AUC,相比多层感知机(MLP)基线分别提升3.37个F1点和0.027个AUC,并报告了95%置信区间。消融研究表明,变换器组件和堆叠策略都对性能有实质性贡献。所提出的方法为结构化表格数据上的当代流失预测提供了一个可重复且可扩展的参考架构。

英文摘要

Customer churn prediction is essential across data-driven industries such as insurance, digital banking, eCommerce, and subscription platforms, where retaining existing customers is typically more cost-effective than acquiring new ones. Predicting churn on structured datasets remains challenging due to class imbalance, nonlinear feature interactions, and heterogeneous feature types. Tree-based ensemble methods consistently demonstrate strong performance in these contexts, often outperforming conventional neural networks. This study introduces a validated hybrid architecture that integrates feature-tokenized transformers (FT-Transformer) with gradient-boosted trees through calibration-aware stacking. The proposed framework addresses persistent gaps in statistical validation, probability calibration, and reproducibility found in prior research. The FT-Transformer captures higher-order feature interactions using self-attention, while XGBoost captures gradient-boosted decision boundaries with complementary inductive biases. Class imbalance is handled using class-weighted loss functions, thereby avoiding synthetic oversampling and preserving minority-class distributions. The models are ensembled using out-of-fold (OOF) stacking with a logistic regression meta-learner, which recalibrates overconfident base model outputs and learns optimal combination weights. On a public bank churn dataset, the hybrid model achieves 62.10% F1, 0.861 AUC-ROC, and 0.647 PR-AUC, outperforming the Multi-Layer Perceptron (MLP) baseline by 3.37 F1 points and 0.027 AUC under 5x5 cross-validation with 95% confidence intervals reported. Ablation studies demonstrate that both the transformer component and stacking strategy contribute materially to performance. The proposed methodology offers a reproducible and extensible reference architecture for contemporary churn prediction on structured tabular data.

2606.07581 2026-06-09 cs.LG cs.AI cs.ET 新提交

Training-Inference Kernel Contracts: Bounding Divergence in Post-Training and Deployment

训练-推理核契约:约束后训练与部署中的偏差

Bruce Changlong Xu, Lan Wu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出核契约框架,通过数值、统计、运行时和可观测性条款约束训练核与推理核之间的分布偏差,并推导偏差界以保障策略梯度无偏性。

详情
AI中文摘要

现代后训练流程通常为其策略π_θ编写一个符号,但通过两个不同的程序进行评估:一个针对自动微分优化的训练核和一个针对低精度、融合、动态批处理服务优化的推理核。在有限精度下,这些核在相同权重下可能产生不同的分布,且差距集中在基准测试未充分代表的切片上。本文提出核契约:一个契约优先的框架,用于指定K_train和K_inf之间可接受的偏差。契约C = (N, S, R, O, Pi) 结合了数值、统计、运行时和可观测性条款,以及从违规到路由操作的升级策略。我们推导了从logit漂移到总变差距离再到有界奖励漂移的链式界限,并将其专门用于强化学习后训练,其中在显式支持和范数假设下,每个token的重要性比率漂移给出了策略梯度偏差的界限。我们还描述了一个四阶段提升管道、在线路由循环以及用于契约工件的极简YAML DSL。本文是一个框架和词汇论文;我们不报告生产规模的实证验证。

英文摘要

A modern post-training pipeline often writes one symbol for its policy, pi_theta, while evaluating it through two different programs: a training kernel optimized for autograd and an inference kernel optimized for low-precision, fused, dynamically batched serving. In finite precision, these kernels can induce different distributions at identical weights, with the gap concentrated on slices that aggregate benchmarks under-represent. This paper proposes kernel contracts: a contract-first framework for specifying acceptable divergence between K_train and K_inf. A contract C = (N, S, R, O, Pi) combines numerical, statistical, runtime, and observability clauses with an escalation policy from violations to routing actions. We derive a chain of bounds from logit drift to total-variation distance to bounded reward drift, and specialize it to RL post-training, where per-token importance-ratio drift yields a bound on policy-gradient bias under explicit support and norm assumptions. We also describe a four-stage promotion pipeline, online routing loop, and minimal YAML DSL for contract artifacts. This is a framework and vocabulary paper; we do not report production-scale empirical validation.

2606.07578 2026-06-09 cs.LG stat.ME stat.ML 新提交

MST-Direct at Scale: Multivariate and Conditional Geostatistical Simulation via Sinkhorn Optimal Transport

大规模MST-Direct:基于Sinkhorn最优传输的多变量与条件地质统计模拟

Tcharlies Bachmann Schmitz

发表机构 * GitHub arXiv

AI总结 提出MST-Direct扩展方法,通过稀疏Sinkhorn匹配器、多变量元组匹配和克里金条件化,实现大规模、多变量、条件地质统计模拟,精确保持联合分布。

详情
AI中文摘要

本文将MST-Direct(一种用于多变量地质统计模拟的基于Sinkhorn传输的匹配方法)从原始的二元、无条件、小网格形式扩展到多变量、条件和大网格设置。我们解决了原始工作中确定的三个主要限制:(i)通过具有O(nC)内存复杂度的稀疏、候选限制的Sinkhorn匹配器,实现超过几千个节点的可扩展性;(ii)通过将目标值元组匹配到独立FFT-MA高斯骨干上扩展到多个变量,该骨干再现指定的变差函数;以及(iii)通过克里金法条件化骨干,同时在其空间位置固定观测数据元组进行硬数据条件化。由于传输计划仍然是目标元组的排列,多变量联合分布被精确保持。该方法使用与直接多变量模拟(DMS)相同的六变量、异方差、强非线性参考分布进行验证,在无条件(200x200)和条件(100x100,200个硬数据样本)场景下,并与投影寻踪多变量变换(PPMT)进行基准比较。结果表明,MST-Direct以零直方图误差再现联合分布,精确满足硬数据,并准确再现指定的空间相关结构,而PPMT仍然是近似。索引术语-最优传输,Sinkhorn算法,地质统计模拟,多变量模拟。

英文摘要

This paper extends MST-Direct, a Matching-via-Sinkhorn-Transport approach for multivariate geostatistical simulation, from the original bivariate, unconditional, small-grid formulation to multivariate, conditional, and large-grid settings. We address the three main limitations identified in the original work: (i) scalability beyond a few thousand nodes through a sparse, candidate-restricted Sinkhorn matcher with O(nC) memory complexity; (ii) extension to multiple variables by matching target value tuples onto an independent FFT-MA Gaussian backbone that reproduces a prescribed variogram; and (iii) hard-data conditioning by fixing observed data tuples at their spatial locations while conditioning the backbone through kriging. Because the transport plan remains a permutation of the target tuples, the multivariate joint distribution is preserved exactly. The method is validated using the same six-variate, heteroscedastic, strongly nonlinear reference distribution employed in Direct Multivariate Simulation (DMS), under both unconditional (200x200) and conditional (100x100, 200 hard-data samples) scenarios, and is benchmarked against the Projection Pursuit Multivariate Transform (PPMT). Results show that MST-Direct reproduces the joint distribution with zero histogram error, exactly honours hard data, and accurately reproduces the prescribed spatial correlation structure, whereas PPMT remains an approximation. Index Terms-Optimal transport, Sinkhorn algorithm, geostatistical simulation, multivariate simulation.

2606.07577 2026-06-09 cs.AI cs.CV cs.SD eess.AS 新提交

OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs

OmniMem: 面向流式音视频大语言模型的扰动感知记忆压缩

Guangzhi Sun, Yixuan Li, Yudong Yang, Chao Zhang

发表机构 * Tsinghua University(清华大学) ByteDance(字节跳动) Department of Engineering, University of Cambridge(剑桥大学工程系)

AI总结 提出OmniMem,一种针对音视频LLM的流式记忆压缩框架,通过模态感知分配和扰动感知选择压缩KV缓存,在保持长视频理解的同时减少内存,在多个基准上提升2-4%准确率。

Comments Code: https://github.com/bytedance/SALMONN/tree/omni_mem

详情
AI中文摘要

音视频大语言模型(LLMs)在长视频理解方面具有强大潜力,但其长视频推理从根本上受到视频令牌和键值(KV)缓存线性增长的制约。我们提出OmniMem,一种专为音视频LLMs设计的内存高效流式框架。与将所有令牌统一处理的现有压缩方法不同,OmniMem引入了一种模态感知的内存分配策略,分别管理视觉和音频上下文,解决了两种模态之间的严重令牌不平衡问题。OmniMem进一步通过扰动感知的内存选择保留信息丰富且非冗余的KV状态,实现紧凑内存而不牺牲长程理解。为了在现实部署约束下加强压缩,我们还探索了预算感知微调,鼓励模型将有用信息整合到保留内存中。在VideoMME Long、LVBench和LVOmniBench上使用video-SALMONN 2+和Qwen-2.5-Omni的实验表明,在相同内存预算下,OmniMem始终比强训练无关压缩基线提高2-4%的绝对准确率,微调后额外提高1-2%。

英文摘要

Audio-visual large language models (LLMs) hold strong promise for long-form video understanding, yet their long-video inference is fundamentally limited by the linear growth of video tokens and key-value (KV) caches. We present OmniMem, a memory-efficient streaming framework designed specifically for audio-visual LLMs. Unlike existing compression methods that treat all tokens uniformly, OmniMem introduces a modality-aware memory allocation strategy that separately manages visual and audio contexts, addressing the severe token imbalance between the two modalities. OmniMem further preserves informative and non-redundant KV states through perturbation-aware memory selection, enabling compact memory without sacrificing long-range understanding. To strengthen compression under realistic deployment constraints, we also explore budget-aware fine-tuning, which encourages the model to consolidate useful information into retained memory. Experiments on VideoMME Long, LVBench, and LVOmniBench with video-SALMONN 2+ and Qwen-2.5-Omni show that OmniMem consistently improves over strong training-free compression baselines by 2-4% absolute accuracy under the same memory budgets, with an additional 1-2% gain after fine-tuning.

2606.07576 2026-06-09 cs.LG cs.ET cs.MA 新提交

When Should an AI Scientist Stop? Verifiable Experiment Steering and Refusal for Autonomous Discovery

AI科学家何时应停止?可验证实验引导与自主发现的拒绝机制

Neel Tushar Shah, Manglam Kartik

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出CARTOGRAPH验证层,通过未解析子空间引导、模糊闭合和残差库检测,在多个测试中优于原始投影,并能识别和撤销库外机制。

Comments Accepted at AI for Science Workshop at ICML 2026

详情
AI中文摘要

我们提出了CARTOGRAPH,一个用于AI科学家的验证层,它结合了未解析子空间实验引导(选择)、显式模糊闭合(解析)和基于残差的库不足检测(拒绝)。在局部线性-高斯桥下,原始未解析投影是各向同性未解析Fisher信息迹,而CARTOGRAPH-A是精确的未解析A最优规则;闭式EIG和Box-Hill作为局部比较器而非全局等价物出现。在五个测试平台上,CARTOGRAPH-A在d=8的重复结构化级联中以129胜0平15负击败原始投影(p<10^-21)。更独特的是,该框架初步识别了三个库外药代动力学机制,然后随着残差暴露结构失配而撤销这些识别,而一个扰动的库内对照始终保持识别。在低维药代动力学和过滤EPA设置中,理论预测并观察到与分歧的近似平局。最后,在已发表的A-Lab自主材料系统的40项阳性主张的回顾性审计中,拒绝守卫标记了所有4项后来在手动重新分析中被视为不确定的主张,同时通过了32/36项已确认的主张。代码可在https://github.com/ai4science-boed/cartograph.git获取。

英文摘要

We present CARTOGRAPH, a verification layer for AI scientists that couples unresolved-subspace experiment steering (select), explicit ambiguity closure (resolve), and residual-based library inadequacy detection (refuse). Under a local linear-Gaussian bridge, raw unresolved projection is the isotropic unresolved Fisher-information trace, while CARTOGRAPH-A is the exact unresolved A-optimal rule; closed-form EIG and Box-Hill arise as local comparators rather than global equivalents. Across five testbeds, CARTOGRAPH-A beats raw projection 129W/0T/15L at d = 8 (p < 10^-21) in a replicated structured cascade. More distinctively, the framework tentatively identifies three out-of-library pharmacokinetic mechanisms and then revokes those identifications as residuals expose structural misfit, while one perturbed in-library control stays identified throughout. In low-dimensional pharmacokinetic and filtered EPA settings, near-ties against disagreement are predicted by theory and observed. Finally, in a retrospective audit of 40 positive claims from the published A-Lab autonomous materials system, the refuse guard flags all 4 claims later marked inconclusive under manual reanalysis while passing 32/36 confirmed claims. Code is available at https://github.com/ai4science-boed/cartograph.git

2606.07571 2026-06-09 cs.LG cs.AI 新提交

Enabling KV Caching of Shared Prefix for Diffusion Language Models

为扩散语言模型启用共享前缀的KV缓存

Younghun Go, Jaehoon Han, Changyong Shin, Chuk Yoo, Gyeongsik Yang

发表机构 * Korea University(高丽大学)

AI总结 针对扩散语言模型中双向注意力导致共享前缀KV不稳定的问题,提出双向前缀缓存(bicache),通过动态识别安全层深度重用KV,避免精度崩溃,提升吞吐量36.3%-98.3%。

详情
AI中文摘要

共享前缀的键值(KV)缓存对于高吞吐量的大语言模型(LLM)服务至关重要,但在新兴的扩散语言模型(DLM)中面临严峻挑战。在DLM中,双向注意力意味着更新任何token都会动态改变整个上下文及其对应的KV。因此,为LLM开发的现有缓存技术(假设KV一旦计算就保持不变)会破坏共享前缀KV。我们的实验表明,将这些技术应用于DLM会导致模型精度几乎降为零。为了解锁高吞吐量的DLM服务,我们提出了双向前缀缓存(bicache),这是第一个用于DLM中共享前缀的KV缓存技术。bicache基于我们全面分析的关键观察设计:共享前缀KV在浅层中保持稳定且可重用,而浅层的深度取决于每个请求中共享前缀token的比例。因此,bicache动态识别用于重用共享前缀KV的安全层深度,并消除冗余计算。评估表明,与现有技术相比,bicache显著提高了服务吞吐量36.3%-98.3%,且没有精度崩溃(仅0-1.8%的差异)。

英文摘要

Key-value (KV) caching for shared prefixes is essential for high-throughput large language model (LLM) serving, but it faces critical challenges in emerging diffusion language models (DLMs). In DLMs, bidirectional attention means that updating any token dynamically alters the entire context and its corresponding KVs. Thus, existing caching techniques developed for LLMs, which assume that KVs remain invariant once computed, corrupt the shared prefix KVs. Our experiments show that applying these techniques to DLMs causes model accuracy to collapse to near zero. To unlock high-throughput DLM serving, we propose bidirectional prefix caching, bicache, the first KV caching technique for shared prefixes in DLMs. bicache is designed based on key observations from our comprehensive analysis: shared prefix KVs remain stable and reusable in shallow layers, while the depth of shallow layers depends on the fraction of shared prefix tokens in each request. Thus, bicache dynamically identifies a safe layer depth for reusing shared prefix KVs and eliminates redundant computation. Evaluations demonstrate that bicache significantly improves serving throughput by 36.3%-98.3% compared to existing techniques without accuracy collapse (only 0-1.8% difference).

2606.07569 2026-06-09 cs.LG 新提交

TriHead-GAN: A Generative Adversarial Network with Triple-Head Discriminator for Carbon Emission Time Series Generation

TriHead-GAN: 一种具有三头判别器的生成对抗网络用于碳排放时间序列生成

Zesen Wang, Lijuan Lan, Yonggang Li, Chunhua Yang

发表机构 * SanMuGuo

AI总结 针对城市级高频碳排放数据稀缺问题,提出TriHead-GAN,通过三头判别器联合监督分布真实性、跨变量依赖和步态平滑性,在多个数据集上优于主流基线并提升下游预测精度。

详情
AI中文摘要

准确的碳排放监测对于气候政策和新兴监管机制(如欧盟碳边境调节机制)至关重要,然而城市级高频监测数据仍然极为稀缺,严重限制了数据饥渴的深度学习模型。时间序列生成是一种自然的补救措施,但现有的基于GAN和扩散的生成器通常对碳排放数据的领域结构提供的显式监督有限:它们可能匹配边际分布统计量,但未能充分保留CO$_2$与共排放污染物和气象因素之间的跨变量相关性,并且倾向于破坏大气测量的一阶差分统计量,产生平均平滑但缺乏底层信号真实步态变异性的序列。我们提出TriHead-GAN,一种基于Transformer的对抗框架,其三头判别器联合监督联合分布的三个互补方面:通过Wasserstein评判器监督分布真实性,通过目标变量的无泄漏回归监督跨变量依赖性,以及通过相邻差分预测监督步态时间平滑性。生成器结合了全局自注意力与局部时间卷积、每步噪声注入以及匹配一阶差分统计量的抗平滑损失。在自收集的长沙碳数据集、两个公共碳数据集(中国、美国)以及ETTh1基准上的实验表明,TriHead-GAN在绝大多数设置下优于主流基线,并且生成的合成窗口在低资源碳监测场景中提高了下游预测准确性。

英文摘要

Accurate carbon emission monitoring is critical for climate policy and emerging regulatory mechanisms such as the EU Carbon Border Adjustment Mechanism, yet city-level high-frequency monitoring data remain extremely scarce, severely limiting data-hungry deep learning models. Time series generation is a natural remedy, but existing GAN and diffusion-based generators often provide limited explicit supervision for the domain structure of carbon emission data: they may match marginal distributional statistics while insufficiently preserving cross-variable correlations between CO$_2$ and co-emitted pollutants and meteorological factors, and tend to collapse the first-difference statistics of atmospheric measurements, producing sequences that are smooth on average but lack the realistic step-wise variability of the underlying signals. We propose TriHead-GAN, a Transformer-based adversarial framework whose triple-head discriminator jointly supervises three complementary aspects of the joint distribution: distributional authenticity via a Wasserstein critic, cross-variable dependency via leakage-free regression of the target variable, and step-wise temporal smoothness via adjacent-difference prediction. The generator combines global self-attention with local temporal convolution, per-step noise injection, and an anti-smoothing loss that matches first-difference statistics. Experiments on the self-collected Changsha Carbon dataset, two public carbon datasets (China, US), and the ETTh1 benchmark show that TriHead-GAN achieves favorable performance over mainstream baselines on the vast majority of settings, and that the resulting synthetic windows improve downstream forecasting accuracy in low-resource carbon monitoring scenarios.

2606.07565 2026-06-09 cs.LG 新提交

STARIXNet: Multivariate and Multi-attribute Deep Learning Approach to Real-Time Resource Allocation in Cloud Platforms

STARIXNet: 云平台中多变量多属性深度学习方法实现实时资源分配

Ahmed Abdulaal, Maruf Aytekin, Thilaga kumaran Srinivasan, Tomer Lancewicki

发表机构 * Walmart Global Tech(沃尔玛全球科技)

AI总结 提出STARIXNet轻量神经网络,通过捕获多系统指标的时空关系进行多变量资源分配,优先服务稳定性再考虑成本效率,在沃尔玛生产环境中节省10%-50%成本。

Comments 11 pages, 12 figures. Under review

详情
AI中文摘要

云平台中微服务的智能伸缩对于缓解不断增长的计算成本同时避免服务中断至关重要。当前解决方案局限于单变量空间,通常仅关注CPU使用率来驱动伸缩决策。此外,它们将问题视为纯预测任务,专注于预测精度而忽略了低估和系统响应延迟的更大风险。替代方案计算复杂,使其难以用于大规模实时部署。为应对这些挑战,我们提出STARIXNet,一种轻量级神经网络,通过捕获多个系统指标间的时空关系,在多变量空间中指导资源分配决策。STARIXNet对多个准依赖属性进行建模,特别是(S)季节性、(T)时间性、(A)自回归(I)综合和(e)外生模式,然后实施聚合策略以最终确定伸缩决策,优先考虑服务稳定性,其次是成本效率,而非原始预测准确性。我们通过在真实环境中与现有解决方案进行基准测试,实证展示了STARIXNet的性能。STARIXNet已部署于沃尔玛的关键生产微服务,实现了10%至50%的实际节省,此外还通过改善服务稳定性和客户体验带来了无形收益。

英文摘要

Intelligent scaling of microservices in cloud platforms is crucial for mitigating escalating compute costs while avoiding service disruptions. Current solutions are limited to the univariate space, typically focusing on CPU usage alone to drive scaling decisions. Moreover, they address the problem as a purely forecasting task, focusing on prediction precision while neglecting the greater risks of underestimation and delays in system responsiveness. Alternative solutions are computationally complex, making them impractical for large-scale, real-time deployments. To address these challenges, we present STARIXNet, a lightweight neural network that guides resource allocation decisions in the multivariate space by capturing spatio-temporal relationships among multiple system metrics. STARIXNet models multiple quasi-dependent attributes, in particular the (S)easonal, (T)emporal, (A)uto-(R)egressive (I)ntegrated, and e(X)ogenous patterns, then implements an aggregation policy to finalize scaling decisions, prioritizing service stability, followed by cost-efficiency, over raw forecast accuracy. We empirically demonstrate the performance of STARIXNet by benchmarking against existing solutions in real-world settings. STARIXNet is deployed for critical production microservices at Walmart achieving tangible savings ranging from 10\% to 50\%, in addition to intangible benefits through improved service stability and customer experience.

2606.07563 2026-06-09 cs.LG cs.AI 新提交

Emergence via Phase Transitions: Mechanism Landscapes and Universal Convergence Across Complex Systems

通过相变涌现:机制景观与跨复杂系统的通用收敛

Truong Xuan Khanh

发表机构 * H&K Research Studio(H&K 研究工作室) Clevix LLC(Clevix 有限责任公司)

AI总结 提出层次涌现框架(HEF),将涌现建模为机制景观中的相变,证明在结构假设下物理可行且收敛到唯一不动点,并在111个模算术变换器实验中验证了相变指纹。

Comments 27 pages, 3 figures, 2 tables; 15-page Supplementary Information with complete proofs included

详情
AI中文摘要

在机器学习、生物学和物理学中,独立演化的系统尽管微观细节截然不同,但常常收敛到惊人相似的高层结构。Grokking电路在不同随机种子下收敛,进化谱系重新发现相似的代谢解决方案,重整化流趋近共同的固定点。我们提出层次涌现框架(HEF)作为此类收敛现象的候选普适性框架。HEF将涌现建模为由热力学和信息论定律约束的机制景观中的相变。该框架引入一个临界能量阈值Ec,将具有竞争机制的探索阶段与由唯一最小成本机制主导的收敛阶段分开。在结构假设下,我们证明了物理可行性,推导了严格的度量收缩,并建立了收敛到与初始条件无关的唯一不动点表示。我们进一步通过有效信息和机制竞争熵将该收敛结构与因果涌现联系起来。为测试该框架,我们研究了111个实验中模算术变换器的延迟泛化(“grokking”)。我们识别出一个可重复的Ec转变经验指纹:在92%的运行中,权重范数在grokking之前系统性达到峰值。归一化准确率曲线坍缩到tanh扭结(R^2=0.93),与Landau-Ginzburg普适类一致,所有grokked模型收敛到0.9745±0.014,与初始化、权重衰减或训练比例无关(ANOVA p>0.13)。HEF并非作为涌现的通用理论提出,而是作为研究跨复杂系统收敛现象的可证伪数学框架。

英文摘要

Across machine learning, biology, and physics, independently evolving systems often converge toward strikingly similar high-level structures despite radically different microscopic details. Grokking circuits converge across random seeds, evolutionary lineages rediscover similar metabolic solutions, and renormalization flows approach common fixed points. We propose the Hierarchical Emergence Framework (HEF) as a candidate universality framework for such convergence phenomena. HEF models emergence as a phase transition in a mechanism landscape constrained by thermodynamic and information-theoretic laws. The framework introduces a critical energy threshold Ec separating an exploration regime with competing mechanisms from a convergence regime governed by a unique minimum-cost mechanism. Under structural assumptions, we prove physical feasibility, derive strict metric contraction, and establish convergence toward a unique fixed-point representation independent of initial conditions. We further connect this convergence structure to causal emergence through Effective Information and mechanism competition entropy. To test the framework, we study delayed generalization ("grokking") in modular arithmetic transformers across 111 experiments. We identify a reproducible empirical fingerprint of the Ec transition: the weight norm peaks systematically before grokking in 92% of runs. Normalized accuracy curves collapse onto a tanh kink (R^2=0.93) consistent with a Landau-Ginzburg universality class, and all grokked models converge to 0.9745+/-0.014 regardless of initialization, weight decay, or training fraction (ANOVA p>0.13). HEF is not presented as a universal theory of emergence, but as a falsifiable mathematical scaffold for studying convergence phenomena across complex systems.

2606.07561 2026-06-09 cs.LG stat.ME stat.ML 新提交

Boundary Variance Inflation Causes Acquisition Bias in Gaussian Processes

边界方差膨胀导致高斯过程中的采集偏差

Maria Bånkestad, Sanna Jarl, Jens Sjölund

发表机构 * RISE Research Institutes of Sweden(瑞典RISE研究院) Uppsala University(乌普萨拉大学)

AI总结 本文揭示有界域上平稳核高斯过程边界方差膨胀的根本原因是核相关邻域截断,并证明该几何扭曲导致三类采集函数产生系统性偏差,提出无函数选择剖面诊断方法。

Comments 14 pages, 8 figures; appendices included

详情
AI中文摘要

具有平稳核的高斯过程在有界域上会在边界附近表现出膨胀的后验方差。尽管这在地统计学中是一个长期被认识到的伪影,并且在贝叶斯优化中是过度探索的来源,但边界引起的采集偏差的原因和影响尚未得到充分探索。我们将根本原因追溯到一个简单的几何机制:核相关邻域在域边界处的截断产生了一种与观测无关的扭曲,且随着维度的增加而恶化。我们展示了这种扭曲如何在三类采集函数中表现出来:方差最大化将选择集中在角落,而负积分后验方差和期望预测信息增益则将选择向内移动到轴向内部壳层。这些模式的出现不依赖于任何目标函数,这意味着采集行为可能由核几何主导,而非期望的任务特定不确定性。为了量化这一点,我们引入了一种针对任意采集函数、核和有界域几何的无函数选择剖面诊断方法。

英文摘要

Gaussian processes with stationary kernels on bounded domains exhibit inflated posterior variance near the boundary. Despite being a long-recognized artifact in geostatistics and a source of over-exploration in Bayesian optimization, the causes and effects of boundary-induced acquisition bias are underexplored. We trace the root cause to a simple geometric mechanism: the truncation of the kernel correlation neighborhood at the domain boundary creates an observation-independent distortion that worsens with dimensionality. We show how this distortion manifests across three acquisition classes: variance maximization concentrates selections at the corners, whereas negative integrated posterior variance and expected predictive information gain move selections inward to axis-aligned interior shells. These patterns arise without reference to any objective function, meaning that acquisition behavior can be dominated by kernel geometry rather than the desired task-specific uncertainty. To quantify this, we introduce a function-free selection-profile diagnostic for arbitrary acquisitions, kernels, and bounded-domain geometries.

2606.07560 2026-06-09 cs.CL cs.LG 新提交

Function-Vector Heads Are Two Populations: Writers and Cancellers in In-Context Learning

函数向量头是两个群体:上下文学习中的写入者和取消者

Han-yu Wang

发表机构 * The University of Hong Kong(香港大学)

AI总结 发现函数向量头并非同质群体,而是分为写入者和取消者两个子群体,分别推高和压低规则正确logit,且仅基于幅度的排名无法区分二者。

详情
AI中文摘要

函数向量头(Todd et al., 2024)通常通过其对上下文规则任务的因果贡献幅度来识别,隐含假设顶级集合是同一功能类。这一假设不成立。我们用保留符号的标准(改进的DLA + 置换FDR)替代仅幅度排名,并通过路径修补验证每个候选。然后,FV头群体分裂为两个对立的子群体:写入者推高规则正确logit;取消者压低它。一个四条件规范判定在三个模型家族和六个Pythia规模的13/15个单元中成立,符号置换检验在5/6个主要单元中拒绝同质性。仅幅度排名无法看到这种结构:Todd的前20个在层次任务中捕获了64%的取消者但仅4%的写入者,在模块任务中捕获了59%的写入者但仅8%的取消者。我们在所有27个(取消者,单元,头)对上排除了六种人为解释:归纳重叠、汇点、通用重要性、秩1复制抑制、V级联和最近邻非FV控制。零消融取消者在6/6个主要单元中产生+0.13到+0.29 nats的logit增益,方向一致地带来+2到+7个百分点的准确率提升。

英文摘要

Function-vector (FV) heads (Todd et al., 2024) are typically identified by the magnitude of their causal contribution to in-context rule tasks, under the implicit assumption that the top set is a homogeneous functional class. This assumption fails. We replace magnitude-only ranking with a sign-preserving criterion (refined DLA + permutation FDR) and validate each candidate by path patching. The FV head population then splits into two opposing sub-populations: writers push the rule-correct logit up; cancellers push it down. A four-condition canonical verdict holds in $13/15$ cells across three model families and six Pythia scales, and a sign-shuffle rejects homogeneity in $5/6$ main cells. The structure is invisible to magnitude-only ranking: Todd's top-$20$ captures $64\%$ of cancellers but only $4\%$ of writers on the hierarchical task, and $59\%$ of writers but only $8\%$ of cancellers on the modular task. We rule out six artefact accounts on all $27$ canceller (cell, head) pairs: induction overlap, sinks, generic importance, rank-$1$ copy-suppression, V-cascade, and rank-nearest non-FV controls. Zero-ablating cancellers yields $+0.13$ to $+0.29$ nats of logit gain in $6/6$ main cells with a directionally consistent $+2$ to $+7$ pp accuracy effect.

2606.07559 2026-06-09 cs.CL cs.AI quant-ph 新提交

Phantom transitions in language model fine-tuning

语言模型微调中的幻影相变

Vaibhav Prakash, Jayasri Dontabhaktuni

发表机构 * Mahindra University(马恒达大学)

AI总结 本文研究语言模型微调时,正确补全被近义词竞争而失败的现象,通过序参量分解信号与背景拖拽,发现两种失败模式,并揭示相变为幻影,源于softmax读出而非几何相变。

Comments 26 pages, 9 figures

详情
AI中文摘要

在上下文中微调语言模型,当正确补全存在近义词竞争者时,常常无声地失败。交叉熵损失单调递减,而正确token在排名上从未超越竞争者。我们研究了跨越两个系列和五倍参数范围的五种Transformer架构,在十个精心挑选的近义词上下文中。我们用一个结合预测分布和成对嵌入重叠的序参量来测量这些失败。它可加性地分解为一个信号(跟踪模型对正确token相对于其最近竞争者的承诺)和一个背景拖拽(由嵌入整体向分数泄漏概率的方式决定)。这分离出两种失败模式:运动学失败中信号保持较小;结构失败中拖拽随着微调进行而主动恶化。我们观察到序参量中类似相变的弹弓状跳跃。一个核心负面结果组织了本文:这些相变是幻影。直接测量排除了自发对称破缺的解释。在LoRA微调下,当token嵌入矩阵在训练期间完全不变时,弹弓状跳跃仍然出现,而此处不可能存在几何相变。不连续性完全存在于softmax读出中。少量无量纲量组织跨架构的轨迹。其中一个在所有五种架构的全微调下保持一致。第二个根据整体嵌入分布将架构分为两类,并预测LoRA的充分性。作为盲测,该框架预测了一个未用于拟合任何参数的保留架构的临界学习率,与后续学习率扫描的误差在2.1%以内。研究结果仅涉及近义词机制,未经重新校准不应外推。

英文摘要

Fine-tuning a language model on contexts whose correct completion has a near-synonym competitor often fails silently. The cross-entropy loss decreases monotonically while the correct token never overtakes the competitor in rank. We study this regime across five transformer architectures spanning two families and a fivefold parameter range, on ten hand-selected near-synonym contexts. We instrument these failures with an order parameter combining the predicted distribution and pairwise embedding overlaps. It decomposes additively into a signal, tracking the model's commitment to the correct token over its nearest competitor, and a background drag, set by how the embedding bulk leaks probability into the score. This isolates two failure modes. In kinematic failure the signal stays small. In structural failure the drag actively worsens as fine-tuning proceeds. We observe sharp catapult-like jumps in the order parameter that resemble a phase transition. A central negative result organises the paper. The transitions are phantoms. The spontaneous-symmetry-breaking interpretation is ruled out by direct measurement. Catapult-like jumps still appear under LoRA fine-tuning with the token embedding matrix exactly unchanged during training, where no geometric phase transition is possible. The discontinuity lives entirely in the softmax readout. A small number of dimensionless quantities organise the trajectory across architectures. One is consistent across all five under full fine-tuning. A second sorts architectures into two classes by bulk embedding distribution and predicts LoRA sufficiency. As a blind test, the framework predicts the critical learning rate of a held-out architecture, not used to fit any parameter, to within 2.1% of a subsequent learning-rate sweep. Findings concern the near-synonym mechanism only and should not be extrapolated without recalibration.

2606.07558 2026-06-09 cs.CV cs.AI cs.DL 新提交

Page image classifier fine-tuned on century-spanning archives of scanned documents for further content-specific processing

基于百年跨度扫描文档档案微调的页面图像分类器,用于进一步的内容特定处理

Kateryna Lutsai, Pavel Straňák, David Novák, Dana Křivánková

发表机构 * Institute of Formal and Applied Linguistics, Charles University MFF(查尔斯大学数学与物理学院形式与应用语言学研究所) Institute of Archaeology, Czech Academy of Sciences(捷克科学院考古研究所)

AI总结 针对历史文档数字化中手动分类不可行的问题,提出基于视觉内容类型(文本、表格、图形)的自动页面图像分类系统,采用微调深度网络(RegNetY-16GF达99.16%准确率)实现近完美分类,并公开模型、数据集和代码。

Comments 29 pages, 19 figures, 13 tables. arXiv admin note: text overlap with arXiv:2507.21114

详情
AI中文摘要

目的:人文学科的数字化项目产生了大量、异构的历史文档档案,使得手动分类在大规模下不切实际。本工作解决基于视觉内容类型——文本、表格和图形——对扫描页面图像进行分类的自动化系统需求,从而支持内容特定的下游处理,如光学字符识别(OCR)或结构化数据提取。方法:开发了一个图像分类系统,并在来自百年历史的捷克考古档案的超过48,000张带注释的历史页面图像数据集上进行评估,通过四个连续的注释阶段和领域专家审查进行优化。使用手工制作的图像特征建立了随机森林分类器基线。随后,微调并比较了深度学习架构:卷积神经网络(EfficientNetV2、RegNetY)、视觉和文档图像变换器(ViT、DiT)以及多模态CLIP模型。与领域专家合作设计了11类标签方案,并通过五折交叉验证进行评估。结果:基于特征的基线实现了约75%的准确率。微调的CNN和变换器显著优于基线,RegNetY-16GF在保留测试集上达到99.16%的Top-1准确率,ViT-large达到99.12%。CLIP ViT-B/16通过优化文本描述达到99.14%的准确率。结论:仅图像模型,特别是RegNetY-16GF,实现了近乎完美的分类准确率,并在649,508张未标注档案页面上产生一致标签,模型间一致性超过90%。微调的CLIP尽管在测试集上具有竞争力,但在未标注数据上与仅图像模型的一致性低于65%,因此不太适合部署。最终模型、注释数据集和软件均以开源许可证公开提供。

英文摘要

Purpose: Digitization projects in the humanities produce vast, heterogeneous archives of historical documents, making manual sorting impractical at scale. This work addresses the need for an automated system to classify scanned page images based on visual content type - text, tables, and graphics - enabling content-specific downstream processing such as Optical Character Recognition (OCR) or structured data extraction. Methods: An image classification system was developed and evaluated on a dataset of over 48,000 annotated historical page images from century-old Czech archaeological archives, refined through four successive annotation stages with domain-expert review. A Random Forest Classifier baseline was established using hand-crafted image features. Subsequently, deep learning architectures were fine-tuned and compared: Convolutional Neural Networks (EfficientNetV2, RegNetY), Vision and Document Image Transformers (ViT, DiT), and multimodal CLIP models. An 11-category label scheme was designed collaboratively with domain experts and evaluated via five-fold cross-validation. Results: The feature-based baseline achieved approximately 75% accuracy. Fine-tuned CNNs and Transformers substantially outperformed it, with RegNetY-16GF achieving 99.16% and ViT-large 99.12% Top-1 accuracy on the held-out test set. CLIP ViT-B/16 reached 99.14% with optimized text descriptions. Conclusion: Image-only models, particularly RegNetY-16GF, deliver near-perfect classification accuracy and produce consistent labels across 649,508 unlabeled archival pages with over 90% inter-model agreement. Fine-tuned CLIP, despite competitive test-set accuracy, showed under 65% agreement with image-only models on unlabeled data, making it less suitable for deployment. The final models, annotated dataset, and software are publicly available under open-source licenses.

2606.07557 2026-06-09 cs.LG cs.MA cs.SI 新提交

SPIN: Decentralized Swarm Control via Tensorized Policy Coordination

SPIN: 通过张量化策略协调实现去中心化集群控制

Zhaowen Fan

发表机构 * Zhaowen Fan(Fan 资深研究员)

AI总结 提出SPIN框架,利用张量网络分解联合策略,将指数复杂度降为线性,并通过离线训练的神经符号管道实现边缘设备上的低延迟去中心化集群控制。

Comments 11 pages, 2 figures, 1 tables, 6 sections

详情
AI中文摘要

在资源受限的边缘平台上,去中心化多智能体集群协调仍然受到联合动作空间指数级扩展和高延迟通信开销的根本性瓶颈。本文介绍了集群策略干扰网络(SPIN)框架,这是一种通过将集群拓扑建模为压缩张量网络来绕过这些限制的架构范式。我们将局部多智能体团簇的联合策略张量分解为矩阵乘积态(MPS)链,将评估的计算复杂度从指数级 $O(n^m)$ 墙降低到严格的线性 $O(m \cdot n \cdot \chi^2)$ 约束。为了在不需高功耗在线训练循环的情况下,将局部连续空间几何与该离散代数后端桥接,我们引入了一个解耦的混合神经符号控制管道。局部多层神经网络作为结构协调编码器,离线预训练以将手工设计的几何描述符非线性映射为抽象环境目标度量。在运行时,边缘智能体通过直接应用 Radon-Nikodým 导数作为零样本重要性重加权滤波器来执行即时行为适应。我们在一个离散时间多智能体仿真沙箱中验证了该框架,涵盖跟踪、去中心化分散/区域覆盖和多目标协调等场景。定性遥测表明,集成管道实现了稳定的目标导向运动、去中心化约束下的抗塌陷空间扩展以及跨多个目标的结构化子群形成,为可处理、低功耗的边缘集群智能提供了一条数学上严谨的路径。

英文摘要

Decentralized multi-agent swarm coordination on resource-constrained edge platforms remains fundamentally bottlenecked by the exponential scaling of joint action spaces and high-latency communication overhead. This paper introduces the Swarm Policy Interference Network (SPIN) framework, an architectural paradigm that bypasses these limitations by modeling swarm topologies as a compressed tensor network. We factorize the joint policy tensors of local multi-agent cliques into Matrix Product State (MPS) chains, reducing the computational complexity of evaluation from an exponential $O(n^m)$ wall to a strictly linear $O(m \cdot n \cdot χ^2)$ constraint. To bridge local continuous spatial geometry with this discrete algebraic backend without requiring power-intensive online training loops, we introduce a decoupled, hybrid neuro-symbolic control pipeline. Local multi-layered neural networks operate as structural coordination encoders, pre-trained offline to nonlinearly map hand-engineered geometric descriptors into abstract environmental target measures. At runtime, edge agents execute instantaneous behavioral adaptations by applying the Radon-Nikodým derivative directly as a zero-shot importance-reweighting filter. We validate the framework within a discrete-time multi-agent simulation sandbox spanning tracking, decentralized dispersion/area coverage, and multi-goal coordination regimes. Qualitative telemetry demonstrates that the integrated pipeline achieves stable target-directed motion, anti-collapse spatial spreading under decentralized constraints, and structured subgroup formation across multiple targets, providing a mathematically grounded route to tractable, low-power edge swarm intelligence.

2606.07553 2026-06-09 cs.LG cs.AI 新提交

MedicalRec: Medical recommender system for image classification without retraining

MedicalRec:无需重新训练的图像分类医疗推荐系统

Roghayeh Taghavi, Aysa Hasanazde Bashkandi, Amir Ali Bengari, Mohammad Amin Raji, Mohammad Salahi Ardekani, Parisa Mardukhian, Parvaneh Rezaei, Ramin Mousa

发表机构 * University of Tehran(塔里班大学)

AI总结 提出基于Transformer的医疗推荐系统MedicalRec,利用从3000篇论文中构建的MedicalRec-Bench数据集(含5000+记录),无需重新训练即可为医疗图像分类任务推荐最优模型,最高HitRate@100达75.5%。

详情
AI中文摘要

机器学习和深度学习的出现彻底改变了医疗保健中诊断、治疗和管理系统的效率。然而,这种快速采用是以需要大量计算能力和能源消耗以及电子垃圾处理和碳排放为代价的。这些模型的挑战之一是为分类任务选择合适的模型。为此,研究人员尝试通过试错法使用他们的数据来确定最佳模型,这涉及能源消耗和浪费。本研究的目标是开发一个基于模型的医疗图像分类推荐系统。为此,从3000篇医疗图像分类领域的文章中收集了一个数据集。该数据集以MedicalRec-Bench的名称公开可用,包含超过5000条在各种任务中测试的模型记录,包括皮肤癌分类、肿瘤分类、伤口分类、乳腺癌和MRI分类。根据特征数量,数据集在四种不同模式下进行评估:MedicalRec I(5个特征)、MedicalRec II(9个特征)、MedicalRec III(11个特征)和MedicalRec IV(18个特征)。由于作者未报告,收集所有特征值具有挑战性;因此,数据集包含大量缺失值。医疗推荐系统(MedicalRec)是一个基于Transformer的模型,用于本研究中的项目推荐。该模型在数据集评估和与12个基础模型的评估中取得了显著成果。该模型实现了最高HitRate@100为75.5%。数据集和实现可通过GitHub链接获取:https://github.com/Ramin1Mousa/MedicalRec

英文摘要

The emergence of machine learning and deep learning has revolutionized the efficiency of diagnostic, therapeutic, and administrative systems in healthcare. However, this rapid adoption has come at the cost of requiring significant computing power and energy consumption, as well as e-waste disposal and carbon emissions. One of the challenges of these models is choosing the right model for classification tasks. To this end, researchers attempt to identify the optimal model using their data through trial and error, which involves energy consumption and waste. The goal of this study is to develop a model-based recommender system for medical image classification. For this purpose, a data set was collected from 3,000 articles in the field of medical image classification. This dataset, publicly available under the name MedicalRec-Bench, contains over 5,000 records of models tested in various tasks, including Skin Cancer Classification, Tumour Classification, Wound Classification, Breast Cancer, and MRI classification. The dataset was evaluated in four different modes, depending on the number of features: MedicalRec I (5 features), MedicalRec II (9 features), MedicalRec III (11 features), and MedicalRec IV (18 features). Collecting all values for the features is challenging due to non-reporting by the authors; hence, the dataset contains significant amounts of missing values. The Medical Recommender System (MedicalRec) is a transformer-based model used for item recommendations in this study. This model achieved remarkable results in the evaluation on the dataset and in the evaluation with 12 base models. This model achieved a maximum HitRate@100 of 75.5%. The dataset and implementations are available through the GitHub link: https://github.com/Ramin1Mousa/MedicalRec

2606.07550 2026-06-09 cs.LG cs.AI 新提交

Offline Reinforcement Learning for Plasma Control in Nuclear Fusion: Codebase and Benchmark

核聚变等离子体控制的离线强化学习:代码库与基准

Yang Fu, Haomin Bao, Rohit Sonker, Xiaoyan Hu, Aravind Venugopal, Jeff Schneider, Jiayu Chen

发表机构 * Central South University(中南大学) Chongqing University(重庆大学) Carnegie Mellon University(卡内基梅隆大学) The University of Hong Kong(香港大学)

AI总结 提出RL4F基准,基于DIII-D托卡马克历史数据构建评估环境,比较多种离线RL方法在等离子体控制任务上的性能,发现基于模型的离线RL方法平均表现最佳。

Comments 23 pages (10 pages main text)

详情
AI中文摘要

离线强化学习(RL)为从历史托卡马克数据开发等离子体控制器提供了一条有前景的途径,因为在真实设备上进行在线试错成本高昂且风险巨大。然而,由于缺乏针对核聚变中现实多执行器、长时域等离子体控制问题的标准化离线RL基准,这一方向的进展仍然难以衡量。我们引入了RL4F,一个用于核聚变等离子体控制的离线强化学习基准,提供了闭环评估环境和四个全剖面跟踪任务(旋转、密度、温度和压力)的基线比较。评估环境背后的动力学函数基于真实托卡马克DIII-D的历史放电数据构建。我们在统一协议下评估了广泛的模仿学习和离线RL基线。我们发现,基于模型的离线RL方法在大多数目标上获得了最佳平均性能,尽管没有单一方法在所有任务中占主导地位,这突显了动力学建模在复杂、长时域等离子体控制任务中的重要性。为了促进进一步研究,我们开源了代码库、数据集和评估框架,不仅为聚变社区,也为离线RL的算法开发提供了一个基准。

英文摘要

Offline reinforcement learning (RL) offers a promising route for developing plasma controllers from historical tokamak data, since online trial-and-error on real devices is costly and risky. However, progress in this direction remains difficult to measure due to the lack of a standardized offline RL benchmark for realistic multi-actuator, long-horizon plasma control problems in nuclear fusion. We introduce RL4F, an Offline Reinforcement Learning Benchmark for Plasma Control in Nuclear Fusion, providing closed-loop evaluation environments and baseline comparisons across four full-profile tracking tasks: rotation, density, temperature, and pressure. The dynamics function underlying the evaluation environment is built from historical discharge data from DIII-D, a real-world Tokamak. We evaluate a broad set of imitation learning and offline RL baselines under a unified protocol. We find that offline model-based RL methods obtain the best average performance on most objectives, although no single method dominates all tasks, highlighting the importance of dynamics modeling in complex, long-horizon plasma control tasks. To foster further research, we open-source the codebase, datasets, and evaluation framework, providing a benchmark not only for the fusion community but also for algorithm development in offline RL.

2606.07549 2026-06-09 cs.AI cs.MA 新提交

PathoSage: Towards Multi-Source Evidence Adjudication in Pathology via Experience-Aware Agentic Workflow

PathoSage:通过经验感知的代理工作流实现病理学多源证据裁决

Chengyang Zhang, Wenchuan Zhang, Bo Li, Mengran Li, Bob Zhang, Yuhao Yi, Hong Bu, Jiancheng Lv

发表机构 * College of Computer Science, Sichuan University(四川大学计算机科学学院) Department of Pathology and Institute of Clinical Pathology, West China Hospital, Sichuan University(四川大学华西医院病理科/临床病理研究所) Department of Computer and Information Science, University of Macau(澳门大学计算机与信息科学系) School of Intelligent Systems Engineering, Sun Yat-sen University(中山大学智能工程学院)

AI总结 提出PathoSage框架,通过结构化证据审议和Beta-Bernoulli经验系统,独立评估工具证据并解决冲突,减少幻觉和分类器分歧,提升病理学推理鲁棒性。

详情
AI中文摘要

多模态大语言模型(MLLMs)和代理工作流的最新进展在计算病理学中显示出巨大潜力,但可靠的补丁级推理仍然具有挑战性。端到端的病理学MLLM常常幻觉形态特征,而最近的代理系统通常将工具输出和检索知识合并到共享上下文中,使得决策容易受到冲突证据和上下文污染的影响。我们提出PathoSage,一个三阶段框架,明确分离知识检索、证据收集和证据裁决,用于补丁级病理学多模态推理。其核心组件结构化证据审议独立评估来自工具的异质证据,执行冲突分析,并在全新上下文中生成最终判断,以减少锚定偏差。我们进一步引入一个无需训练的Beta-Bernoulli经验系统,具有连续信用分配,以建模长期工具可靠性,并为未来工具使用构建相似性加权先验。实验表明,PathoSage有效缓解了VQA幻觉和分类器分歧,优于强病理学MLLM和代理基线。我们的结果强调了明确的证据裁决和可靠性感知工具建模是构建鲁棒病理学代理的关键要素。

英文摘要

Recent advances in Multimodal Large Language Models (MLLMs) and agent workflows have shown strong promise for computational pathology, yet reliable patch-level reasoning remains challenging. End-to-end pathology MLLMs often hallucinate morphological features, while recent agentic systems usually merge tool outputs and retrieved knowledge into a shared context, making decisions vulnerable to conflicting evidence and context contamination. We propose PathoSage, a three-stage framework that explicitly separates knowledge retrieval, evidence collection, and evidence adjudication for patch-level pathology multimodal reasoning. Its core component, Structured Evidence Deliberation, independently evaluates heterogeneous evidence from tools, performs conflict analysis, and generates the final judgment in a fresh context to reduce anchoring bias. We further introduce a training-free Beta-Bernoulli experience system with continuous credit assignment to model long-term tool reliability and construct similarity-weighted priors for future tool use. Experiments show that PathoSage effectively mitigates VQA hallucinations and classifier disagreement, outperforming strong pathology MLLM and agentic baselines. Our results highlight explicit evidence adjudication and reliability-aware tool modeling as key ingredients for robust pathology agents.

2606.07547 2026-06-09 cs.CL cs.AI cs.SD 新提交

Liberating LLM Capabilities in Full-Duplex Speech Models

在全双工语音模型中释放LLM能力

Luoyuan Zhang, Bokai Xu, Junbo Cui, Weiyue Sun, Yingjing Xu, Hanyu Liu, Yuan Yao

发表机构 * Royal Zhang(皇家张)

AI总结 提出Listen-Write-Speak (LWS)三通道范式,使LLM在共享因果注意力上下文中同时监听、书写可见文本并实时口语回应,无需架构修改,实现全双工交互。

详情
AI中文摘要

基于语音的大型语言模型通常局限于口语回复,这将其面向用户的输出限制在可口头表达的内容上,并抑制了文本原生能力,如代码生成、结构化分析和实时交互中的多步推理,对于需要持久、结构化且可检查的中间输出的任务。现有工作改进了口语推理或全双工轮流发言,但仍将文本视为隐藏的中间状态或从属模态,而非第一类输出通道。我们提出Listen-Write-Speak (LWS),一种文本优先的三通道范式,其中单个自回归LLM持续监听用户音频,写出可见的自由形式文本作为其主要输出,并在共享因果注意力上下文中并行生成实时口语回应。该行为完全通过Token Schema实现,无需架构修改,并通过两阶段数据流水线学习,该流水线合成与揭示的输入时间线一致的每秒认知注释。实验上,LWS在Full-Duplex-Bench上展示了强大的全双工交互,在VoiceBench AlpacaEval上达到4.72,写作-口语一致性达92.6%,并在URO-Bench上持续优于其内部消融版本。这些结果表明,可见书写可以作为语音交互的第一类输出通道,而不会牺牲实时响应性。代码和数据集可在项目页面获取:https://royalzhang.com/project/lws-page/。

英文摘要

Speech-based large language models are typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a single autoregressive LLM continuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a shared causal attention context. This behavior is implemented entirely through a Token Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-second cognitive annotations consistent with the revealed input timeline. Empirically, LWS demonstrates strong full-duplex interaction on Full-Duplex-Bench, reaches 4.72 on VoiceBench AlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations on URO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.

2606.07540 2026-06-09 cs.CL 新提交

Finding Hidden Relationships Between Medical Concepts by Leveraging Metamap and Text Mining Techniques

利用MetaMap和文本挖掘技术发现医学概念间的隐藏关系

Weikang Yang, S M Mazharul Hoque Chowdhury, Wei Jin

发表机构 * Department of Computer Science, North Dakota State University(北达科塔州立大学计算机科学系) Department of Computer Science and Engineering, Daffodil International University(达芙妮国际大学计算机科学与工程系) Department of Computer Science and Engineering, University of North Texas(德克萨斯大学诺丁汉分校计算机科学与工程系)

AI总结 提出一种结合MetaMap和文本挖掘的新模型,通过构建综合索引结构发现医学概念间的跨文档隐藏关联,实验验证了其有效性。

Journal ref Advanced Data Mining and Applications (ADMA) 2022

详情
AI中文摘要

文本是当今计算机化世界中最常见的数据存储方式之一。乍一看,这些数据似乎互不关联。但实际上,数据可能存在隐藏的联系。因此,本研究提出了一种新模型,该模型通过使用MetaMap和适当的文本挖掘技术,能够发现两个医学概念之间的隐藏关系。具体来说,该模型创建了一种新的综合索引结构,能够发现连接感兴趣主题的跨文档隐藏链接,而大多数现有方法忽略了这些链接。实验表明,所提出的模型在发现主题间新联系方面具有有效性。

英文摘要

Text is one of the most common ways to store data in this computerized world. At a glance, it may seem that those data are not interconnected. But in reality, data can have hidden connections. Therefore, in this research, a new model has been presented that can find hidden relationships between two medical concepts by using MetaMap and appropriate text-mining techniques. Specifically, the model creates a new comprehensive index structure and can find cross-document hidden links connecting topics of interest that most existing approaches have ignored. Experiments show the effectiveness of the proposed model in discovering new connections between topics.

2606.07535 2026-06-09 cs.CL 新提交

Multilingual Refusal Alignment for Safer Large Language Models

多语言拒绝对齐:构建更安全的大型语言模型

Aleksandra Krasnodębska, Wojciech Kusa, Aldo Lipani

发表机构 * NASK National Research Institute(国家研究 institute) University College London(伦敦大学学院)

AI总结 本研究系统探究多语言对齐动态,通过引入覆盖12种欧洲语言的RefusEU数据集和DPO实验,发现仅用英语对齐不足以保障跨语言安全,而多语言训练可在不降低通用性能的前提下提升安全性。

Comments Accepted to Findings ACL 2026

详情
AI中文摘要

随着大型语言模型(LLMs)在全球范围部署,确保其在多种语言中的安全性和对齐变得至关重要。然而,安全行为在不同语言之间往往表现出不可预测的差异,这对一致且合乎道德的人工智能构成了重大挑战。在这项工作中,我们系统研究了多语言对齐的动态,探讨了单语言对齐是否能够跨语言迁移、训练过程中如何保持语言一致性,以及由此产生的与通用知识能力之间的权衡。我们引入了RefusEU,一个覆盖12种欧洲语言的新型拒绝对齐数据集,其中包括一个用于评估当前最先进模型的专用测试集。我们的受控直接偏好优化(DPO)实验提供了两个关键见解:仅在英语中对齐模型不足以确保跨语言安全性,即使对于相同的危害类别也是如此;而使用多语言数据集进行训练可以在不降低通用性能(通过Global MMLU基准衡量)的情况下提高安全性。

英文摘要

As Large Language Models (LLMs) are deployed globally, ensuring their safety and alignment across multiple languages becomes paramount. However, safety behaviors often vary unpredictably between languages, posing significant challenges for consistent and ethical AI. In this work, we systematically investigate the dynamics of multilingual alignment, exploring whether single-language alignment transfers cross-lingually, how language consistency is preserved during training, and the resulting trade-offs with general knowledge capabilities. We introduce RefusEU, a novel refusal alignment dataset covering 12 European languages, including a dedicated test set for evaluating current state-of-the-art models. Our controlled Direct Preference Optimization (DPO) experiments provide two key insights: aligning models exclusively in English is insufficient to ensure cross-lingual safety, even for the same harm categories, whereas training on multilingual datasets can improve safety without degrading general performance, as measured by the Global MMLU benchmark.

2606.07533 2026-06-09 cs.CL cs.AI cs.SD 新提交

Bridging Traditional Explainability Methods and Multimodal Multilingual Models: An XAI-Based Analysis

桥接传统可解释性方法与多模态多语言模型:基于XAI的分析

Paweł Pozorski, Jakub Muszyński, Maria Ganzha

发表机构 * arXiv

AI总结 提出多模态Shapley值框架,结合频谱图引导的音素对齐(SGPA)预处理方法,实现文本与音频特征的可解释性归因,并开源计算包与可视化工具。

Comments Bachelor's thesis

详情
AI中文摘要

多模态大语言模型(MLLMs)有效整合文本和音频以理解复杂交互对话中的上下文。然而,异质模态影响模型行为的内部机制仍然不透明。虽然Shapley值(SV)为基于文本的NLP提供了鲁棒的、模型无关的局部可解释性框架,但其扩展到多模态数据受到跨通道依赖、复杂对话结构以及密集音频表示的高计算复杂性的阻碍。\n在这项工作中,我们形式化了Shapley值框架的多模态扩展,将离散文本标记和对齐的音频片段视为协作特征。为确保计算可行性,我们部署了一套高效的估计策略:低维输入的精确SV计算和基于采样的近似——包括蒙特卡洛排列和具有Neyman最优分配的分层抽样——以在有限计算预算下最小化方差。为解决模态间的粒度不匹配问题,我们提出了频谱图引导的音素对齐(SGPA),一种新颖的预处理方法,将高频音频流映射到可解释的、单词对齐的片段。\n我们的贡献有两方面:首先,我们提供了一个开源的、模型无关的Python包和配套的GUI,用于多模态归因的计算和交互式可视化。其次,我们使用VoiceBench和Infinity Instruct数据集的精选子集,在多种多语言场景下评估我们的框架。实验结果表明,输入模态是归因波动的主要驱动因素,并证明标准句法重要性代理在多模态跨语言上下文中通常无法预测模型注意力。

英文摘要

Multimodal Large Language Models (MLLMs) effectively integrate text and audio to interpret context in complex interactive dialogues. However, the internal mechanisms by which heterogeneous modalities influence model behavior remain opaque. While Shapley Values (SV) provide a robust, model-agnostic framework for local explainability in text-based NLP, their extension to multimodal data is hindered by cross-channel dependencies, intricate dialogue structures, and the prohibitive computational complexity of dense audio representations. In this work, we formalize a multimodal extension of the Shapley Value framework, treating discrete text tokens and aligned audio segments as cooperative features. To ensure computational feasibility, we deploy a suite of efficient estimation strategies: exact SV computation for low-dimensional inputs and sampling-based approximations - including Monte Carlo permutations and stratified sampling with Neyman-optimal allocation - to minimize variance under constrained computational budgets. To resolve the granularity mismatch between modalities, we propose Spectrogram-Guided Phonetic Alignment (SGPA), a novel preprocessing method that maps high-frequency audio streams to interpretable, word-aligned segments. Our contribution is twofold: first, we provide an open-source, model-agnostic Python package and a companion GUI for the computation and interactive visualization of multimodal attributions. Second, we evaluate our framework using curated subsets of the VoiceBench and Infinity Instruct datasets across diverse multilingual scenarios. Our experimental results reveal that input modality is a primary driver of attribution volatility and demonstrate that standard syntactic importance proxies often fail to predict model attention in multimodal, cross-lingual contexts.

2606.07531 2026-06-09 cs.CL cs.AI 新提交

mllm-shap: A Shapley Value Explainability Platform for Text-Audio Multimodal Large Language Models

mllm-shap:面向文本-音频多模态大语言模型的Shapley值可解释性平台

Jakub Muszyński, Paweł Pozorski, Maria Ganzha

发表机构 * Warsaw University of Technology(华沙理工大学)

AI总结 提出mllm-shap框架,通过模态感知掩码、多轮对话追踪和音素对齐分组技术,将Shapley值可解释性扩展到文本-音频多模态大语言模型,并实现10-50倍的计算加速。

Comments Submitted to ACL2026

详情
AI中文摘要

我们介绍了mllm-shap,一个开源Python框架,旨在将Shapley值(SV)可解释性从纯文本大语言模型扩展到处理联合文本和音频输入的多模态大语言模型(MLLM)。虽然基于文本的归因已得到充分研究,但mllm-shap解决了多模态领域特有的三个关键挑战:(1)模态感知的联盟掩码,管理离散文本令牌和密集音频编码器帧的交错处理。(2)多轮对话追踪,利用每令牌元数据维护角色和模态上下文。(3)基于音素对齐的令牌分组,一种新颖的技术,将联盟空间减少10到50倍,使得长音频的SV估计在计算上可行。该平台实现了五种SV估计策略,包括具有Neyman最优分配的互补贡献(CC)估计器,其收敛性优于标准蒙特卡洛基线。mllm-shap作为pip可安装包提供,并具有交互式基于Web的GUI,用于细粒度归因可视化。据我们所知,这是第一个公开可用的框架,为文本-音频MLLM中的基于SV的可解释性提供完整、可复现的流水线。

英文摘要

We introduce mllm-shap, an open-source Python framework designed to extend Shapley Value (SV) explainability from text-only Large Language Models to Multimodal LLMs (MLLMs) processing joint text and audio inputs. While text-based attribution is well-studied, mllm-shap addresses three critical challenges unique to the multimodal regime: (1) Modality-aware coalition masking, which manages the interleaved processing of discrete text tokens and dense audio encoder frames. (2) Multi-turn conversation tracking, utilizing per-token metadata to maintain role and modality context. (3) Phonetic alignment-based token grouping, a novel technique that reduces the coalition space by 10x to 50x, rendering SV estimation computationally feasible for long-form audio. The platform implements five SV estimation strategies, including a Complementary Contributions (CC) estimator with Neyman-optimal allocation that demonstrates superior convergence over standard Monte Carlo baselines. mllm-shap is provided as a pip-installable package featuring an interactive web-based GUI for granular attribution visualization. To our knowledge, this is the first publicly available framework providing a complete, reproducible pipeline for SV-based explainability in text-audio MLLMs.

2606.07529 2026-06-09 cs.CL cs.AI cs.CV cs.LG cs.MM 新提交

CAPruner: Conceptual-Adjacent Scene Graph Pruner for Enhancing 3D Spatial Reasoning of Large Language Models

CAPruner: 概念相邻场景图剪枝器以增强大语言模型的3D空间推理

Shengli Zhou, Xiangchen Wang, Guanhua Chen, Feng Zheng

发表机构 * Southern University of Science and Technology(南方科技大学) SpatialTemporal AI(时空人工智能)

AI总结 提出概念相邻场景图剪枝器(CAPruner),通过融合模糊语义相关性和空间邻近性估计关系重要性,在任务特定上下文中选择关键关系,避免关系级标注,显著提升大语言模型在3D视觉语言任务上的空间推理性能。

Comments Accepted by ACL 2026 Main Conference

详情
AI中文摘要

大型语言模型(LLMs)最近被应用于3D视觉语言(3D-VL)任务,这些任务需要空间推理以识别相对于锚点的目标物体。场景图通常用于表示此类关系,但在完整图上进行推理会导致高昂的令牌成本和计算效率低下,因此需要剪枝。现有的剪枝方法主要依赖空间邻近性,常常移除任务相关的关系,从而削弱可靠的空间推理。为了解决这些局限性,我们推导出场景图剪枝的一个关键要求:保留与特定3D-VL任务最相关的空间关系。在此洞察指导下,我们提出了概念相邻场景图剪枝器(CAPruner)。CAPruner将模糊语义相关性与空间邻近性相结合,以估计关系的重要性,从而能够在任务特定上下文中选择关键关系。此外,为了避免昂贵的关系级标注,CAPruner通过监督每个节点入射边的聚合分数进行训练。大量实验表明,CAPruner有效保留了空间推理所必需的关系,从而显著提升了LLMs在3D-VL任务上的性能。代码可在 https://github.com/fz-zsl/CAPruner 获取。

英文摘要

Large language models (LLMs) have recently been applied to 3D vision-language (3D-VL) tasks, which require spatial reasoning to identify target objects relative to anchors. Scene graphs are commonly employed to represent such relations, but reasoning over complete graphs incurs high token costs and computational inefficiencies, motivating the need for pruning. Existing pruning methods primarily rely on spatial proximity and often remove task-relevant relations, thereby undermining reliable spatial reasoning. To address these limitations, we derive a key requirement for scene graph pruning: preserving spatial relations that are most pertinent to the specific 3D-VL task. Guided by this insight, we propose the Conceptual-Adjacent Scene Graph Pruner (CAPruner). CAPruner integrates fuzzy semantic relevance with spatial proximity to estimate the importance of relations, enabling the selection of critical relations in a task-specific context. Moreover, to avoid costly relation-level annotations, CAPruner is trained by supervising the aggregated scores of each node's incident edges. Extensive experiments demonstrate that CAPruner effectively preserves relations essential for spatial reasoning, leading to substantial performance improvements of LLMs on 3D-VL tasks. Code is available at https://github.com/fz-zsl/CAPruner.

2606.07528 2026-06-09 cs.CL cs.AI cs.LG 新提交

BEACON: Behavioral Entropy Aggregation for Cross-Model Hallucination Detection in Large Language Models

BEACON: 面向大语言模型跨模型幻觉检测的行为熵聚合

Naveen Bera, Pulijala Sai Nikhila, Kondaguduru Abhiram, Shaik Gayaz Ali, Shoaib Sadiq Salehmohamed, Shaik Mohammed Omar, Jinal Prashant Thakkar, Hansika Aredla, Shalmali Ayachit

发表机构 * LLM Lens

AI总结 提出BEACON框架,通过多维度行为特征(语义熵、嵌入几何、思维链一致性、释义稳定性)的黑盒检测方法,在7个基准上达到0.8123 AUROC,优于现有方法。

Comments 12 pages, 6 tables, 1 figure. Code and data available upon request

详情
AI中文摘要

大语言模型中的幻觉,即生成事实上不正确或未经支持的内容,仍然是可靠部署的关键障碍。我们提出了BEACON(面向跨模型幻觉检测的行为熵聚合),一个黑盒幻觉检测框架,仅基于模型输出运行,无需访问内部表示或外部知识库。BEACON从结构化的多遍生成中提取31维特征向量,整合了基于NLI的语义熵、嵌入几何、思维链一致性和释义稳定性信号。在七个基准的7,617个标记样本上训练的梯度提升分类器达到了0.8123 ± 0.0102的AUROC(95%置信区间:0.7632-0.8251),优于独立的语义熵(+0.2298)和SelfCheckGPT风格的一致性基线(+0.2457)。特征重要性分析表明,幻觉本质上是多维的,需要组合的不确定性信号。一个高效的5次调用变体达到了0.7795的AUROC,使得在黑盒LLM API上的实际部署成为可能。

英文摘要

Hallucination in large language models (LLMs), defined as the generation of factually incorrect or unsupported content, remains a critical barrier to reliable deployment. We present BEACON (Behavioral Entropy Aggregation for Cross-model hallucination detectiON), a black-box hallucination detection framework that operates purely on model outputs without requiring access to internal representations or external knowledge bases. BEACON extracts a 31-dimensional feature vector from structured multi-pass generation, integrating NLI-based semantic entropy, embedding geometry, chain-of-thought consistency, and paraphrase stability signals. A gradient-boosted classifier trained on 7,617 labeled examples across seven benchmarks achieves 0.8123 +/- 0.0102 AUROC (95% CI: 0.7632-0.8251), outperforming standalone semantic entropy (+0.2298) and SelfCheckGPT-style consistency baselines (+0.2457). Feature importance analysis shows that hallucination is inherently multi-dimensional, requiring combined uncertainty signals. An efficient 5-call variant achieves 0.7795 AUROC, enabling practical deployment across black-box LLM APIs.

2606.07527 2026-06-09 cs.CL cs.AI cs.LG 新提交

Post-training is (Massive) Supervised Learning

后训练是(大规模)监督学习

Michael Hassid, Yossi Adi, Roy Schwartz

发表机构 * FAIR, Meta AI(Meta AI 基础人工智能研究团队) The Hebrew University of Jerusalem(耶路撒冷希伯来大学)

AI总结 本文论证当前LLM后训练阶段(SFT+RL)实质是回归到BERT时代的“预训练-微调”范式,通过实验表明从零开始后训练的模型也能取得显著性能,并提出应转向“学会学习”的训练方式。

详情
AI中文摘要

训练LLM的主流范式已演变为依赖包含SFT和RL的大规模后训练阶段。在这篇立场论文中,我们认为这种方法实际上标志着回归到BERT时代的“预训练然后微调”方法,明确地使模型适应期望的行为和评估所用的特定基准。我们首先回顾LLM的历史,描述LLM演化的不同阶段。我们认为当前格局与LLM早期惊人地相似,那时任务性能严重依赖于将模型拟合到分布内数据集。为了实证证明这一点,我们比较了预训练模型和随机初始化模型,在现代推理数据集上对两种变体进行微调,并在竞争性数学和代码基准上评估它们。我们表明,从头开始后训练的模型产生了高度非平凡的性能。我们的发现表明,当前的后训练方法主要作为分布拟合机制发挥作用。最后,我们提出,开发通用能力的模型和系统需要超越针对预定义行为的广泛后训练,转而采用模型“学会如何学习”的训练过程。

英文摘要

The prevailing paradigm for training LLMs has evolved to rely on a massive post-training phase consisting of SFT and RL. In this position paper, we argue that this methodology effectively marks a reversion to the ``pre-train then fine-tune'' approach of the BERT era, explicitly tailoring models to the desired behaviors and specific benchmarks on which they are evaluated. We begin with a historical overview of LLMs, describing the different phases of the LLM evolution. We argue that the current landscape is remarkably similar to the early days of LLMs, where task performance heavily relied on fitting the models to in-distribution datasets. To empirically demonstrate this, we compare pre-trained models to randomly initialized ones, by fine-tuning both variants on modern reasoning datasets and evaluating them on competitive math and code benchmarks. We show that models post-trained from scratch yield highly non-trivial performance. Our findings suggest that current post-training methodologies function primarily as a distribution-fitting mechanism. We finish by positing that developing generally capable models and systems requires moving beyond extensive post-training for predefined behaviors, shifting instead toward training procedures where models ``learn how to learn''.

2606.07526 2026-06-09 cs.CL cs.AI 新提交

GraphLoRA: Structure-Aware Low-Rank Adaptation for Large Language Model Recommendation

GraphLoRA: 面向大语言模型推荐的结构感知低秩适配

Lin Mu, Guoji Wang, Li Ni, Lei Sang, Zhize Wu, Peiquan Jin, Yiwen Zhang

发表机构 * Anhui University(安徽大学) Hefei University(合肥大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出GraphLoRA框架,通过在低秩适配路径中嵌入可训练的图消息传递网络,实现结构信号传播,从而深度融合图结构与文本语义,提升LLM推荐性能。

Comments ACL 2026 findings

详情
AI中文摘要

大型语言模型(LLM)因其强大的推理和泛化能力,在推荐任务(LLMRec)中展现出巨大潜力。然而,如何有效对齐LLM建模的文本语义与协同信号仍是一个关键挑战。现有方法要么将协同信息转化为文本提示,要么将预训练嵌入注入LLM,两者都将结构信息视为静态输入,无法捕获高阶关系依赖。为弥合这一差距,我们提出GraphLoRA,一种新颖的框架,将低秩适配从独立传播推广到结构感知传播。GraphLoRA在低秩适配路径中嵌入一个可训练的图消息传递网络,使结构信号能够在参数空间中传播。该设计允许协同拓扑显式指导参数更新,促进图结构与文本语义信息的深度融合。在多个基准上的大量实验表明,GraphLoRA不仅优于最先进的基于LLM的推荐方法,而且实现了卓越的泛化能力,有效平衡了结构推理能力与计算效率。代码可在https://github.com/wgj15965/GraphLoRA获取。

英文摘要

Large Language Models (LLMs) have shown strong potential for recommendation (LLMRec) due to their powerful reasoning and generalization abilities. However, effectively aligning the textual semantics modeled by LLMs with the collaborative signals remains a key challenge. Existing methods either translate collaborative information into textual prompts or inject pre-trained embeddings into the LLM, both of which treat structural information as static input and fail to capture high-order relational dependencies. To bridge this gap, we propose GraphLoRA, a novel framework that generalizes low-rank adaptation from independent to structure-aware propagation. GraphLoRA embeds a trainable graph message-passing network within the low-rank adaptation pathway, enabling structural signals to propagate through the parameter space. This design allows collaborative topology to explicitly guide parameter updates, fostering deep integration between graph-structured and textual semantic information. Extensive experiments on multiple benchmarks demonstrate that GraphLoRA not only outperforms state-of-the-art LLM-based recommendation methods but also achieves superior generalization, effectively balancing structural reasoning capability with computational efficiency. Code is available at \href{https://github.com/wgj15965/GraphLoRA}{https://github.com/wgj15965/GraphLoRA}.