arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.05179 2026-06-05 cs.CL

Efficient Punctuation Restoration via Weighted Lookahead Scoring Method for Streaming ASR Systems

流式ASR系统中基于加权前瞻评分的高效标点恢复方法

Sungmook Woo, Hyungu Kang, Chanwoo Kim

发表机构 * Korea Advanced Institute of Science and Technology（韩国科学技术院）

AI总结提出一种非自回归的加权前瞻评分方法，通过比较标点插入假设与无插入基线，在有限未来上下文中实现流式ASR的高效标点恢复，无需微调即可达到高F1分数。

详情

Comments: Accepted for presentation at The International Joint Conference on Neural Networks (IJCNN) 2026

AI中文摘要

标点恢复提高了ASR（自动语音识别）的可读性。然而，流式ASR需要在有限的未来上下文下进行在线决策。在流式ASR中，系统增量地预测标点，这使得基于生成的方法在边界评估下容易产生延迟和对齐失败。本文提出了一种非自回归评分方法（无自由形式生成），该方法保留输入转录并在每个词边界做出决策。我们的方法在有限的K子词标记前瞻下，将标点插入假设与无插入基线进行比较，并使用权重α和验证校准阈值τ（推理期间无参数更新）校准决策。在IWSLT 2017上，我们的评分方法在无微调设置下（验证校准，K=2）实现了4类宏F1为0.893，在微调后（K=2）达到0.937，在相同的前瞻预算下优于基于提示的基线（0.566）和微调的ELECTRA基线（0.913）。我们通过消融研究分析了前瞻预算对K的影响。

英文摘要

Punctuation restoration improves ASR (Automatic Speech Recognition) readability. However streaming ASR requires online decisions with limited future context. In streaming ASR, the system predicts punctuation incrementally, which makes generation-based approaches prone to latency and alignment failures under boundary-wise evaluation. This paper proposes a non-autoregressive scoring method (no free-form generation) that preserves the input transcript and makes a decision at each word boundary. Our method compares punctuation insertion hypotheses against a no-insertion baseline under a bounded K-subword-token lookahead, and calibrates decisions using a weight α and a validation-calibrated threshold τ (no parameter updates during inference). On IWSLT 2017, our scoring method achieves a 4-class macro F1 of 0.893 in the no fine-tuning setting (validation-calibrated, K=2) and 0.937 after fine-tuning (K=2), outperforming the prompt-based baseline (0.566) and a fine-tuned ELECTRA baseline (0.913) under the same lookahead budget. We analyze the impact of the lookahead budget through ablation studies on K.

URL PDF HTML ☆

赞 0 踩 0

2606.05177 2026-06-05 cs.CL cs.AI eess.AS

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

MCBench：面向全能大语言模型的多上下文安全评估基准

Manh Luong, Tamas Abraham, Junae Kim, Amar Kaur, Rollin Omari, Gholamreza Haffari, Trang Vu, Lizhen Qu, Dinh Phung

发表机构 * Monash University（墨尔本大学）； Defence Science and Technology Group（国防科学与技术集团）

AI总结针对现有多模态安全基准仅处理视觉输入的局限，提出MCBench基准，包含1196个跨四类安全场景的测试，要求整合多模态信息进行安全评估，揭示当前全能大语言模型在跨模态安全推理上的不足。

详情

AI中文摘要

现有的多模态安全基准仅关注视觉输入，无法评估处理视觉、音频和文本的全能大语言模型（LLMs）。我们提出了MCBench，一个包含1196个场景的基准，涵盖四个安全类别，需要整合多种模态以进行准确的安全评估。每个不安全场景都配有一个最小差异的安全对照场景，以评估模型的敏感性。我们对最先进模型的评估揭示了重大挑战。全能大语言模型在处理细微或非物理风险时表现不佳，但在存在显著视觉或听觉线索时表现更好。对推理轨迹的分析表明，尽管模型能够提取模态特定信息，但它们往往无法有效整合这些线索进行安全判断。我们的发现揭示了当前全能大语言模型在安全关键场景中缺乏稳健的跨模态推理能力，强调了改进多模态安全架构和训练策略的必要性。

英文摘要

Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four safety categories that require integrating multiple modalities for accurate safety assessment. Each unsafe scenario is paired with a minimally different safe counterpart to assess model sensitivity. Our evaluations of state-of-the-art models reveal significant challenges. Omni LLMs struggle with subtle or non-physical risks but perform better when salient visual or acoustic cues are present. Analysis of reasoning traces shows that, although models can extract modality-specific information, they often fail to integrate these cues effectively for safety judgments. Our findings reveal that current Omni LLMs lack robust cross-modal reasoning in safety-critical settings, underscoring the need for improved architectures and training strategies for multimodal safety.

URL PDF HTML ☆

赞 0 踩 0

2606.05176 2026-06-05 cs.CL cs.AI

PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis

面向电信客户支持的SLM的PEFT：LoRA配置与能耗分析的比较研究

Lucas Tamic, Ilan Jaffeux-Cheniout, Xavier Marjou

发表机构 * Orange

AI总结本研究系统比较了不同LoRA配置在Qwen2.5-3B模型上的参数高效微调效果，结合能耗分析和LLM评判框架，发现验证损失最低的配置并不一定获得最佳定性排名，并提出了组合式合成数据生成方法。

详情

AI中文摘要

尽管大型语言模型（LLM）在自然语言理解和生成方面表现出色，但它们在电信客户支持领域特定约束下的评估和适应性仍然有限。此外，数据主权、监管约束以及敏感客户和网络信息的处理使得在该领域使用外部托管的基础模型变得复杂。我们提出了一项系统的参数高效微调（PEFT）研究，使用低秩适配（LoRA）应用于Qwen2.5-3B，以构建特定领域的对话助手。我们引入了一种基于52个行业特定术语词汇表的组合式合成数据生成方法，通过由Gemini 2.0 Flash驱动的生成流水线，产生了约30,000个训练样本，涵盖1,560个不同的问题场景。我们通过改变超参数和目标模块评估了16种LoRA配置。我们的评估超越了标准指标，结合了能耗分析以及使用GPT-5.2和Claude 4.5 Sonnet的LLM-as-a-judge框架的定性评估。结果显示定量和定性性能之间存在明显分歧：达到最低验证损失的模型不一定获得最佳的人类对齐排名。最佳验证损失（0.5024）在定性评估中仅排名第6-7位，而最差损失（0.6807）根据两位评判者均排名第一。本工作的贡献包括：（1）一种用于合成数据集构建的组合方法，（2）关于目标模块选择对LoRA注入影响的见解，（3）证明在对话式AI中仅凭验证损失不足以选择微调配置的证据，以及（4）用于可持续LLM部署的能耗-性能权衡分析。

英文摘要

While large language models (LLMs) show strong performance in natural language understanding and generation, their evaluation and adaptation to domain-specific constraints in telecommunications customer support remain limited. In addition, data sovereignty, regulatory constraints, and the handling of sensitive customer and network information complicate the use of externally hosted foundation models in this domain. We present a systematic study of parameter-efficient fine-tuning (PEFT) using Low-Rank Adaptation (LoRA) applied to Qwen2.5-3B to build a domain-specific conversational assistant. We introduce a combinatorial synthetic data generation approach based on a glossary of 52 industry-specific terms, producing approximately 30,000 training examples across 1,560 distinct problem scenarios via a generative pipeline powered by Gemini 2.0 Flash. We evaluate 16 LoRA configurations by varying hyperparameters and target modules. Our evaluation extends beyond standard metrics by incorporating energy consumption analysis and qualitative assessment using an LLM-as-a-judge framework with GPT-5.2 and Claude 4.5 Sonnet. Results show a clear divergence between quantitative and qualitative performance: models achieving the lowest validation loss do not necessarily obtain the best human-aligned rankings. The best validation loss (0.5024) ranks only 6th-7th in qualitative evaluation, while the worst loss (0.6807) ranks first according to both judges. This work contributes (1) a combinatorial method for synthetic dataset construction, (2) insights into the impact of target module selection for LoRA injection, (3) evidence that validation loss alone is insufficient for selecting fine-tuning configurations in conversational AI, and (4) an energy-performance trade-off analysis for sustainable LLM deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.05175 2026-06-05 cs.CL

Generic Triple-Latent Compression with Gated Associative Retrieval

通用三重潜在压缩与门控关联检索

Liu Xiao

发表机构 * Institute of Informatics, University of Science and Technology of China（中国科学技术大学信息科学研究院）

AI总结提出通用三重潜在序列模型，通过维护运行令牌状态和压缩对记忆路径捕获高阶令牌交互，无需基准特定解析，在字节级WikiText-2和基于分词器的MiniMind语言模型基准上改进小型Transformer基线，而基于召回的门控键值检索扩展提升关联召回但存在种子敏感性和速度问题。

2606.05174 2026-06-05 cs.CL cs.AI

Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO

通过基于方差感知的评分规则奖励与GRPO改进LLMs中心脏医学问答

Arash Ahmadi, Parisa Masnadi, Sarah Sharif, Charles Nicholson, David Ebert, Mike Banad

发表机构 * School of Electrical and Computer Engineering, University of Oklahoma, Norman, OK, USA（电气与计算机工程学院，俄克拉荷马大学，诺曼，OK，USA）； Intelligent Neuromorphic and Quantum Understanding for Innovative Research and Engineering (INQUIRE) Laboratory, University of Oklahoma, Norman, OK, USA（创新研究与工程智能神经形态与量子理解实验室，俄克拉荷马大学，诺曼，OK，USA）； Khiabani Data Science and Analytics Institute, University of Oklahoma, Norman, OK, USA（Khiabani数据科学与分析研究所，俄克拉荷马大学，诺曼，OK，USA）； Data Institute for Societal Challenges (DISC), University of Oklahoma, Norman, OK, USA（社会挑战数据研究所（DISC），俄克拉荷马大学，诺曼，OK，USA）； School of Industrial and Systems Engineering, University of Oklahoma, Norman, OK, USA（工业与系统工程学院，俄克拉荷马大学，诺曼，OK，USA）； Office of Responsible Artificial Intelligence (ORAI), University of Arizona, Tucson, AZ, USA（负责任人工智能办公室（ORAI），亚利桑那大学，图森，AZ，USA）

AI总结提出一种方差感知奖励框架，结合GRPO和RaR-Medicine的评分规则，通过连续分析奖励函数替代离散聚合，提升LLMs在心脏医学问答上的准确率和F1分数。

详情

Comments: 27 Pages

AI中文摘要

大型语言模型（LLMs）在医疗应用中展现出巨大潜力。然而，由于数据隐私限制、推理成本以及边缘或设备端适用性有限，通用模型在实际场景中的部署仍然困难。这些挑战促使开发更小、更高效的模型，这些模型需要稳健的后训练策略以确保可靠的医学推理。在这项工作中，我们研究了基于RaR-Medicine的评分规则监督，使用组相对策略优化（GRPO）对LLMs进行心脏医学问答的后训练。我们提出了一种方差感知奖励框架，该框架扩展了评分规则作为奖励的显式聚合和隐式聚合策略，将加权二元标准聚合和单一整体Likert式评分替换为从标准级评分结果导出的连续分析奖励函数。这种公式为稀疏、多标准且难以自动验证的反馈提供了更丰富的优化信号，并实现了更稳定的在线策略强化学习。在HealthBench保留的心脏相关子集上，与Qwen3-14B基础模型相比，我们最佳的GRPO变体将准确率从0.362提高到0.502，F1从0.532提高到0.668，同时与GPT-OSS-120B（准确率0.508，F1 0.674）保持竞争力。我们的研究结果表明，精心设计的基于评分规则的奖励为改进LLMs中心脏医学问答提供了一种实用策略，并有可能扩展到其他基于评分规则的任务。

英文摘要

Large Language Models (LLMs) have shown strong promise in healthcare applications. Yet deploying general-purpose models in real-world settings remains difficult due to data privacy constraints, inference costs, and limited suitability for edge or on-device use. These challenges motivate the development of smaller, more efficient models that require robust post-training strategies to ensure reliable medical reasoning. In this work, we investigate Group Relative Policy Optimization (GRPO) for post-training LLMs on heart-focused medical question answering with rubric-based supervision derived from RaR-Medicine. We propose a Variance-Aware Reward Framework that extends the Explicit Aggregation and Implicit Aggregation strategies of Rubrics as Rewards by replacing weighted binary criterion aggregation and single overall Likert-style scoring with continuous analytical reward functions derived from criterion-level rubric outcomes. This formulation provides richer optimization signals for feedback that is sparse, multi-criteria, and difficult to verify automatically, and enables more stable on-policy reinforcement learning. On a held-out heart-related subset of HealthBench, our best GRPO variant improves accuracy from 0.362 to 0.502 and F1 from 0.532 to 0.668 relative to the Qwen3-14B base model, while remaining competitive with GPT-OSS-120B (0.508 accuracy, 0.674 F1). Our findings show that carefully designed rubric-based rewards provide a practical strategy for improving heart-focused medical question answering in LLMs, with potential to extend to other rubric-based tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.05173 2026-06-05 cs.CL cs.AI

Predict and Reconstruct: Joint Objectives for Self-Supervised Language Representation Learning

预测与重构：自监督语言表示学习的联合目标

Aimen Boukhari

发表机构 * École Nationale Supérieure d’Informatique (ESI)（阿尔及利亚国家信息学院（ESI））

AI总结提出一种结合JEPA潜空间预测损失与MLM目标的混合预训练目标，通过可学习标量平衡两者，在GLUE基准上分析表明混合编码器产生更均匀的嵌入和更丰富的谱几何，且语义-词汇平衡更优。

详情

Comments: 12 pages, 10 figures, 11 tables. Preprint. Code available at : https://github.com/aymen-000/predict-reconstruct-language-models

AI中文摘要

掩码语言建模（MLM）自BERT以来一直是文本编码器的主导预训练目标，但它鼓励的表示强烈锚定于表层形式的词元身份，而非更深层的语义结构。受联合嵌入预测架构（JEPA）（LeCun, 2022）在视觉和音频中的成功启发，我们提出一种混合预训练目标，该目标在单个共享编码器上结合了JEPA风格的潜空间预测损失与标准MLM目标。一个可学习的标量参数在训练过程中持续平衡这两个目标。我们在英文维基百科上使用相同的架构和计算预算（NVIDIA H100）预训练了一个混合模型和一个纯MLM基线。通过四种池化策略在五个GLUE基准（SST-2、MRPC、MNLI、CoLA、STS-B）上进行广泛的表示分析，结果显示混合编码器产生了显著更均匀的嵌入（均匀性小于-0.16，而MLM为-0.05），在最大池化下表现出更丰富的谱几何，编码了更少的表层词汇信息，并实现了更好的语义-词汇平衡。尽管线性探测的下游准确率相似，但几何差异一致且显著，表明JEPA预测目标重塑了潜空间，而标准准确率指标无法单独捕捉这一点。

英文摘要

Masked language modelling (MLM) has been the dominant pre-training objective for text encoders since BERT, yet it encourages representations that are strongly anchored to surface-form token identity rather than deeper semantic structure. Inspired by the success of Joint Embedding Predictive Architectures (JEPA) (LeCun, 2022) in vision and audio, we propose a hybrid pre-training objective that combines a JEPA-style latent-space prediction loss with a standard MLM objective over a single shared encoder. A learnable scalar parameter continuously balances the two objectives during training. We pre-train both a hybrid model and a pure-MLM baseline on English Wikipedia using identical architectures and compute budgets (NVIDIA H100). Extensive representation analysis across five GLUE benchmarks (SST-2, MRPC, MNLI, CoLA, STS-B) using four pooling strategies reveals that the hybrid encoder produces significantly more uniform embeddings (uniformity less than -0.16 vs -0.05 for MLM), exhibits richer spectral geometry under max pooling, encodes less surface-level lexical information, and achieves a better semantic-to-lexical balance. Despite similar linear-probe downstream accuracy, the geometric differences are consistent and significant, suggesting that the JEPA predictive objective reshapes the latent space in ways that standard accuracy metrics alone cannot capture.

URL PDF HTML ☆

赞 0 踩 0

2606.05170 2026-06-05 cs.LG

ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models

ERRORQUAKE: 开放权重大语言模型中重尾错误严重性分布

Jason Z Wang

发表机构 * Independent（独立）

AI总结提出Errorquake-10k基准测试，通过连续严重性评分和古登堡-里希特指数b揭示开放权重LLM在相同准确率下错误严重性分布存在显著差异，证明严重性分布与错误率信息非冗余。

详情

Comments: 28 pages, 12 figures, appendix and checklist

AI中文摘要

在匹配的准确率下，开放权重LLM的错误严重性分布形状存在显著差异——这种差异在标量错误率中不可见。幻觉基准测试报告单一错误计数，将所有错误视为等价，然而错误的日期和捏造的法庭裁决相差数个数量级。我们引入Errorquake-10k，一个包含10,000个查询的基准测试，在8个领域和5个难度等级上对每个响应进行0-4连续严重性评分，并为21个开放权重模型拟合每个模型的严重性分布。对于每个模型，我们估计严重性分布指数b（古登堡-里希特上尾斜率），并给出95%自助法置信区间。要点：在210个模型对中，有85对在人类共识评分上匹配准确率（|Δε| < 0.05）时具有不相交的95% b置信区间，例如deepseek-v3.2与ministral-14b在ε=0.586和Δb=0.47处。一项包含519个项目、三位评分者的人类验证研究确认了测量可靠性（ICC(2,k=3)=0.85），验证了LLM评判排名（ρ=0.89），并确认了人类数据上的密集模型缩放相关性（ρ_s=-0.86）。我们证明了一个不可约简性定理，表明严重性分布和错误率在信息上非冗余（I(b; model | ε)=1.56比特；跨模型b方差的64.5%无法由ε解释）。一个严重性机制分类（κ=0.83）揭示了错误类型随严重性发生类别性转变：低严重性错误为检索错误（71%）；高严重性错误为捏造（39%）——并且这种组成因模型大小而异（p<0.0001）。严重性分布应与准确率一同报告；它携带了错误率无法提供的判别信息。

英文摘要

At matched accuracy, open-weight LLMs differ substantially in the shape of their error severity distribution -- a difference invisible to the scalar error rate. Hallucination benchmarks report a single error count and treat all errors as equivalent, yet a wrong date and a fabricated court ruling differ by orders of magnitude. We introduce Errorquake-10k, a 10,000-query benchmark scoring each response on a continuous 0-4 severity scale across 8 domains and 5 difficulty tiers, and we fit per-model severity distributions for 21 open-weight models. For each model we estimate a severity distribution index (b, the Gutenberg-Richter upper-tail slope) with 95% bootstrap confidence intervals. Headline: across the 210 model pairs, 85 have disjoint 95% b confidence intervals at matched accuracy (|Delta epsilon| < 0.05) on human-consensus scoring, e.g. deepseek-v3.2 vs. ministral-14b at epsilon = 0.586 and Delta b = 0.47. A 519-item three-rater human validation study confirms measurement reliability (ICC(2,k=3) = 0.85), validates the LLM-judge ranking (rho = 0.89), and confirms the dense-model scaling correlation on human data (rho_s = -0.86). We prove a Non-Reducibility Theorem showing that severity profile and error rate are informationally non-redundant (I(b; model | epsilon) = 1.56 bits; 64.5% of cross-model b variance is unexplained by epsilon). A severity mechanism taxonomy (kappa = 0.83) reveals that error type shifts categorically with severity: low-severity errors are retrievals (71%); high-severity errors are fabrications (39%) -- and this composition differs by model size (p < 0.0001). Severity distribution should be reported alongside accuracy; it carries discriminative information that the error rate cannot.

URL PDF HTML ☆

赞 0 踩 0

2606.05169 2026-06-05 cs.LG

The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models

评估盲点：大型语言模型基准覆盖的体视学理论

Jason Z Wang

发表机构 * Independent（独立）

AI总结本文提出一种体视学理论，通过有效维度和可见豪斯多夫距离量化LLM基准覆盖的盲点，发现结构盲点远超统计噪声，并基于子模贪心算法和特征值分析提出稳定核心基准集。

详情

Comments: 55 pages, 3 figures, 3 tables, extensive appendix with proofs

AI中文摘要

我们给出了LLM基准覆盖的体视学理论。对于任何有效维度为d_eff的测试套件，两个与相同分数一致且凸的能力轮廓之间的可见豪斯多夫距离由epsilon + C R m^(-1/(d_eff-1))界定，并具有匹配的Lipschitz下界。实验上，三个独立的排行榜（Open LLM v2、扩展的12基准套件、LiveBench）在其竞争前沿的有效维度均在[2.86, 4.80]范围内；结构盲点超过观察到的亚军分数差距两个数量级，并比统计噪声高52-127倍。在卡方投影模型下，各向同性先验是乐观情况；在六个隐藏能力先验和四个环境维度下，前两个模型的模拟半分割交换率保持在[0.38, 0.49]，而500次随机可见/保留分割显示，92%的试验交换了前1名排名，平均2.83个前5名模型发生变化。具有Nemhauser (1 - 1/e)保证的子模贪心算法找到了一个由4个基准组成的稳定核心；12个基准中的7个足以达到90%的覆盖率，并且训练好的子集在时间季度间转移时保留率为93-97%。跨12个内部基准和27个Chatbot Arena类别的反事实验证证实，特征结构预测哪些评估是不可替代的（移除干扰的rho = -0.69，p = 0.013）以及哪些外部评估带来新信息（rho = +0.38）。作为第二个独立的理论贡献，我们解决了C^2支撑函数的Gardner问题1.5（1995），通过S^(D-1)上的最优恢复理论建立了一般维度下的极小极大速率Theta(R/(kappa m^(2/(D-1))))。

英文摘要

We give a stereological theory of LLM benchmark coverage. For any suite with effective dimensionality d_eff, the visible Hausdorff distance between two convex capability profiles consistent with the same scores is bounded by epsilon + C R m^(-1/(d_eff-1)), with matching Lipschitz lower bound. Empirically, three independent leaderboards (Open LLM v2, an extended 12-benchmark suite, LiveBench) all have d_eff in [2.86, 4.80] on their competitive frontier; the structural blind spot exceeds the observed runner-up score gap by two orders of magnitude and dominates statistical noise by 52-127x. Under a chi-squared projection model, the isotropic prior is the optimistic case; across six hidden-capability priors and four ambient dimensions the simulated half-split swap rate of the top two models stays in [0.38, 0.49], and a 500-trial random visible/held-out split shows that 92% of trials swap the top-1 ranking with on average 2.83 of 5 top-5 models changing. A submodular greedy algorithm with the Nemhauser (1 - 1/e) guarantee finds a stable core of 4 benchmarks; 7 of 12 suffice for 90% coverage, and the trained subset transfers across temporal quarters with 93-97% retention. A counterfactual validation across 12 internal benchmarks and 27 Chatbot Arena categories confirms that the eigenstructure predicts which evaluations are irreplaceable (rho = -0.69, p = 0.013 for removal disruption) and which external evaluations bring new information (rho = +0.38). As a second, independent theoretical contribution, we resolve Gardner's Problem 1.5 (1995) for C^2 support functions, establishing the minimax rate Theta(R/(kappa m^(2/(D-1)))) in general dimension via optimal recovery theory on S^(D-1).

URL PDF HTML ☆

赞 0 踩 0

2606.05168 2026-06-05 cs.CL cs.AI cs.LG

Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics

模型崩溃的流行病学：通过双层SIR动力学建模合成数据污染

Xiangyu Wang

发表机构 * Xiangyu Wang（王翔宇）

AI总结提出双层耦合SIR/SIRS框架，将数据语料库和AI模型视为两个相互作用的群体，通过交叉层传播模拟合成数据污染导致的模型崩溃，并推导基本再生数R0，实验验证了阈值动力学和干预策略的有效性。

详情

Comments: 24 pages, 15 figures

AI中文摘要

在合成数据上训练会导致模型崩溃，但现有分析将其视为单链退化。实际上，AI生态系统涉及交叉污染：模型从其他模型摄取合成数据，产生新的合成文本，并污染共享语料库。我们提出了一个双层耦合SIR/SIRS框架——一个现象学平均场模型，将数据语料库和AI模型视为两个相互作用的群体，每个群体具有易感、感染和恢复三个仓室，并通过跨层传播连接。SIRS变体（我们的主要推荐）包含了免疫衰减，反映了过滤后的语料库和重新训练的模型仍然容易再次污染。我们通过下一代矩阵推导出基本再生数$R_0 = \sqrt{β_D β_M / [(γ_D+μ_D)(γ_M+μ_M)]}$，并将标准流行病阈值结果应用于双层系统。基于公开AI文本流行数据的说明性情景校准在三种情景下均产生超临界动力学（$R_0 > 1$）；Sobol敏感性分析将合成文本检测识别为最高杠杆参数。一个二分网络基于智能体的模型在密集网络上确认了平均场一致性（$R^2 > 0.96$），但在异质性下退化。GPT-2污染链实验（在WikiText和Shakespeare上共192次运行）显示了剂量-反应退化和多样性损失，定性上与阈值图像一致。匹配预算的源多样性实验（1,088次运行）提供了提示性证据，表明多源混合适度减轻了崩溃，但该效应在较低污染分数下消失。干预分析将基于检测的过滤和群体免疫识别为最高杠杆策略。

英文摘要

Training on synthetic data causes model collapse, but existing analyses treat this as single-chain degradation. In reality, the AI ecosystem involves cross-contamination: models ingest synthetic data from other models, produce new synthetic text, and contaminate shared corpora. We propose a bilayer coupled SIR/SIRS framework -- a phenomenological mean-field model treating data corpora and AI models as two interacting populations, each with susceptible, infected, and recovered compartments linked by cross-layer transmission. The SIRS variant (our primary recommendation) incorporates immunity waning, reflecting that filtered corpora and retrained models remain susceptible to re-contamination. We derive the basic reproduction number $R_0 = \sqrt{β_D β_M / [(γ_D+μ_D)(γ_M+μ_M)]}$ via the Next Generation Matrix and apply standard epidemic threshold results to the bilayer system. Illustrative scenario-based calibration from public AI text prevalence data yields supercritical dynamics ($R_0 > 1$) across three scenarios; Sobol sensitivity analysis identifies synthetic-text detection as the highest-leverage parameter. A bipartite-network agent-based model confirms mean-field consistency ($R^2 > 0.96$) for dense networks but degrades under heterogeneity. GPT-2 contamination chain experiments (192 runs across WikiText and Shakespeare) show dose-response degradation and diversity loss qualitatively consistent with the threshold picture. Matched-budget source-diversity experiments (1,088 runs) provide suggestive evidence that multi-source mixing modestly attenuates collapse, but the effect vanishes at lower contamination fractions. Intervention analysis identifies detection-based filtering and herd immunity as the highest-leverage strategies.

URL PDF HTML ☆

赞 0 踩 0

2605.04733 2026-06-05 cs.AI

Reward-Decomposed Reinforcement Learning for Immersive Video Role-Playing

面向沉浸式视频角色扮演的奖励分解强化学习

Miao Wang, Yuling Shi, Yijiang Li, Yeheng Chen, Xiaodong Gu, Bin Li, Bo Gao, Jun Wang, Zengxin Han, Jingtong Wu, Yaduan Ruan

发表机构 * Nanjing University（南京大学）； Shanghai Jiao Tong University（上海交通大学）； University of California, San Diego（加州大学圣地亚哥分校）； Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（中国科学院深圳先进技术研究院）； School of Information Engineering, Beijing Institute of Graphic Communication（北京印刷学院信息工程学院）； Ant International, Ant Group（蚂蚁集团国际部）； Independent Researcher（独立研究者）

AI总结提出EBM-RL框架，通过奖励分解的强化学习优化视频角色扮演中的视觉感知、推理与生成过程，提升场景一致性与角色真实性。

详情

AI中文摘要

基于文本的角色扮演模型可以模仿角色风格，但通常难以捕捉场景氛围和不断变化的紧张感，而这些对于VR游戏和互动叙事等沉浸式应用至关重要。我们研究视频驱动的角色扮演对话，并引入EBM-RL（眼-脑-口强化学习），一种解耦的GRPO框架，将观察（<perception>）、推理（<think>）和话语生成（<answer>）分离。该设计模仿人类的“看-思-说”过程，使模型在推理和响应生成之前能够基于视觉感知进行对话。为了优化这一“看-思-说”过程，EBM-RL集成了针对场景-文本对齐、感知-认知效用、答案忠实度和格式一致性的互补奖励。大量实验表明，在我们的沉浸式角色扮演基准测试中，EBM-RL显著优于纯文本角色扮演基线和更大规模的视觉语言模型，提高了视觉-氛围一致性和角色真实性。此外，EBM-RL在无需额外微调的情况下，展现出对域外VideoQA基准的强零样本迁移能力。我们还发布了一个用于视频驱动角色扮演对话的开源数据集。

英文摘要

Text-based role-playing models can imitate character styles, but often fail to capture scene atmosphere and evolving tension, which are crucial for immersive applications such as VR games and interactive narratives. We study video-grounded role-playing dialogue and introduce EBM-RL (Eye--Brain--Mouth Reinforcement Learning), a decoupled GRPO-based framework that separates observation (<perception>), reasoning (<think>), and utterance generation (<answer>). This design mimics the human See-Think-Speak process, enabling the model to ground dialogue in visual perception before reasoning and response generation. To optimize this See-Think-Speak process, EBM-RL integrates complementary rewards for scene--text alignment, perceptual--cognitive utility, answer faithfulness, and format consistency. Extensive experiments show that EBM-RL substantially outperforms text-only role-playing baselines and larger-scale vision-language models on our immersive role-playing benchmark, improving both visual-atmosphere consistency and character authenticity. Moreover, EBM-RL demonstrates strong zero-shot transfer to out-of-domain VideoQA benchmarks without additional fine-tuning. We also release an open-source dataset for video-grounded role-playing dialogue.

URL PDF HTML ☆

赞 0 踩 0

2606.05104 2026-06-05 cs.AI

Knowledge Index of Noah's Ark

诺亚方舟的知识索引

Sheng Jin, Minghao Liu, Yunze Xiao, Zeqi Zhou, Heli Qi, Yifan Yao, Meishu Song, Kaijing Ma, Xuan Zhang, Sicong Jiang, Yizhe Li, Ningshan Ma, Jie Wei, Ziniu Li, Minglai Yang, Bangya Liu, Yiming Liang, Xiao Fang, Qingcheng Zeng, Jiarui Liu, Rui Yang, Shen Yan, Wenhao Huang, Jiaheng Liu, Zihan Wang, Weihao Xuan, Ge Zhang

发表机构 * M-A-P ； Carnegie Mellon University（卡内基梅隆大学）； Brown University（布朗大学）； Waseda University（早稻田大学）； The University of Tokyo（东京大学）； Massachusetts Institute of Technology（麻省理工学院）； University of Arizona（亚利桑那大学）； Northwestern University（西北大学）； Duke-NUS Medical School（杜克-新加坡国立大学医学院）

AI总结针对LLM知识基准的代表性、注释质量和排名稳定性问题，提出KINA基准，通过贪婪近似实现学科代表性，并证明奖金锦标赛机制优于固定支付，实验显示顶级模型性能远未饱和。

详情

AI中文摘要

LLM的知识基准面临三个问题：扩展驱动的设计未能实现学科代表性；固定支付注释允许懒惰共识；在有限测试预算下排名稳定性未经审计。我们引入KINA，一个涵盖261个细粒度学科的899项基准，并有两个形式化结果。首先，我们将代表性视为对专家引出的锚点的覆盖目标，并通过代理实现学科代表性，得到(1-1/e)贪婪近似（命题1）；该保证适用于代理，而非总体代表性。其次，我们证明在发布-评审质量方面，奖金锦标赛弱FOSD支配固定支付，激励相容阈值为B > ΔC / Δp_min（定理1）。评估来自13个实验室的42个模型，最佳模型Gemini-3.1-Pro-Preview达到53.17%，其次是Claude-Opus-4.6的49.92%和GPT-5.4的48.55%，远未饱和。完整排行榜显示分层结构而非平滑全序：小型前沿层高于48%，密集的强模型层约38-45%，低性能模型仅略高于10%随机基线。工具增强在五个工具使用评估中最多增加5.17分，不同模型增益差异显著。我们报告自举排名稳定性统计，以明确有限预算方差并防止过度解释相邻排名。

英文摘要

Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize disciplinary representativeness through a proxy, yielding a (1-1/e) greedy approximation (Proposition 1); the guarantee applies to the proxy, not to population representativeness. Second, we prove a bonus-on-bar tournament weakly FOSD-dominates flat payment in released-review quality, with incentive-compatibility threshold B > Delta C / Delta p_min (Theorem 1). Evaluating 42 models from 13 labs, the top model, Gemini-3.1-Pro-Preview, reaches 53.17%, followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, leaving substantial headroom below saturation. The full leaderboard shows a tiered structure rather than a smooth total order: a small frontier tier lies above 48%, a dense strong-model tier spans roughly 38-45%, and low-performing models remain only modestly above the 10% chance baseline. Tool augmentation adds up to 5.17 points across the five tool-use evaluations, with gains varying substantially across models. We report bootstrap ranking-stability statistics to make bounded-budget variance explicit and to discourage over-interpretation of adjacent ranks.

URL PDF HTML ☆

赞 0 踩 0

2606.04811 2026-06-05 cs.CV

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

Dream.exe: 视频生成模型能否梦想出可执行的机器人操作？

Rui Zhao, Kaiming Yang, Jifeng Zhu, Siyang Chen, Ziqi Wang, Weijia Wu, Kevin Qinghong Lin, Heng Wang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore（新加坡国立大学Show实验室）； University of Oxford（牛津大学）； Tencent（腾讯）

AI总结提出Dream.exe评估框架，通过视频到执行流水线测试视频生成模型产生的运动能否转化为可执行的机器人操作，发现视觉质量不能预测可执行性。

详情

AI中文摘要

视频生成模型在合成视觉上引人注目的内容方面取得了令人印象深刻的进展，但其输出仍然局限于虚拟领域。一个自然的问题随之而来：当这些模型生成的视频离开屏幕进入现实时，它们对物理世界的反映有多好？我们提出机器人操作作为这个问题的具体、可测量的窗口：如果一个模型真正内化了物理定律，它所描绘的运动应该转化为可执行的机器人行为。我们引入了Dream.exe，一个通过视频到执行流水线来操作这一标准的评估框架。给定一个场景图像和任务描述，Dream.exe合成一个操作视频，将生成的运动转换为机器人轨迹，并在物理模拟器中执行，产生纯视觉指标无法提供的接地信号。使用这个流水线，我们评估了8个模型，涵盖前沿闭源生成器、开源生成器和机器人专用模型。我们的基准测试包括101个手动策划的操作任务，分为三个物理复杂度级别，通过视觉质量、轨迹保真度和执行成功率进行测量。令人鼓舞的是，几个模型取得了可测量的执行成功率，表明从互联网规模数据中学习的生成先验已经编码了有意义的物理知识。然而，视觉质量被证明是执行性的差预测器，暴露了标准视觉评估未捕获的模型能力维度。Dream.exe将在https://github.com/showlab/Dream.exe开源。

英文摘要

Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it depicts should translate into executable robot behavior. We introduce Dream$.$exe, an evaluation framework that operationalizes this criterion through a video-to-execution pipeline. Given a scene image and a task description, Dream$.$exe synthesizes a manipulation video, converts the generated motion into robot trajectories, and executes them in a physics simulator, yielding a grounding signal that purely visual metrics cannot offer. Using this pipeline, we evaluate 8 models spanning frontier closed-source generators, open-source generators, and robot-specific models. Our benchmark covers 101 manually curated manipulation tasks at three levels of physical complexity, measured across visual quality, trajectory fidelity, and execution success. Encouragingly, several models achieve measurable execution success, suggesting that generative priors learned from internet-scale data already encode meaningful physical knowledge. Yet visual quality proves a poor predictor of executability, exposing a dimension of model capability that standard visual evaluations do not capture. Dream$.$exe will be open-sourced at https://github.com/showlab/Dream.exe.

URL PDF HTML ☆

赞 0 踩 0

2606.04777 2026-06-05 cs.LG

UniFair: A unified fair clustering approach based on separation and compactness

UniFair: 基于分离度和紧致性的统一公平聚类方法

Antonia Karra, Vasiliki Papanikou, Georgios Vardakas, Evaggelia Pitoura, Aristidis Likas

发表机构 * University of Ioannina（伊奥安纳大学）； Archimedes/Athena Research Center（阿基米德/雅典娜研究中心）

AI总结提出UniFair框架，通过联合优化分离公平性和社会公平性，在保持聚类质量的同时减少群体差异。

详情

Comments: 17 pages, 6 Figures

AI中文摘要

聚类越来越多地用于支持高影响决策，但诸如$k$-means等标准目标可能会产生对人口群体不平等对待的聚类结果。现有的公平聚类方法通常优化单一公平性概念，并且常常忽略聚类成本如何与诱导决策边界的几何形状相互作用。我们提出 extsc{UniFair}，一个统一框架，联合优化\emph{分离公平性}和\emph{社会公平性}。分离公平性鼓励受保护群体远离诱导决策边界，而社会公平性通过惩罚群体级聚类成本来减少簇内失真的差异。我们为分离公平和统一$k$-means目标开发了基于梯度的优化过程，并通过在自编码器的潜在空间中强制执行相同标准将其扩展到深度聚类。在表格和图像数据集上的实验表明， extsc{UniFair}在仅适度增加聚类损失的情况下，减少了与边界相关和基于成本的群体差异。

英文摘要

Clustering is increasingly used to support high-impact decisions, yet standard objectives such as k-means can produce clusterings that treat demographic groups unequally. Existing fair clustering methods typically optimize a single notion of fairness and often overlook how clustering costs interact with the geometry of the induced decision boundaries. We propose UniFair, a unified framework that jointly optimizes separation fairness and social fairness. Separation fairness encourages protected groups to lie farther from the induced decision boundaries, while social fairness reduces disparities in within-cluster distortion by penalizing group-wise clustering costs. We develop gradient-based optimization procedures for separation-fair and unified k-means objectives, and extend them to deep clustering by enforcing the same criteria in the latent space of an autoencoder. Experiments on tabular and image datasets show that UniFair reduces both boundary-related and cost-based group disparities with only a modest increase in clustering loss.

URL PDF HTML ☆

赞 0 踩 0

2606.04708 2026-06-05 cs.RO cs.AI

VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

VISTA: 基于视觉和物理验证的UMI数据适配用于VLA训练

Siyuan Yang, Linzheng Guo, Ouyang Lu, Zhaxizhuoma, Daoran Zhang, Xinmiao Wang, Ting Xiao, Fangzheng Yan, Zhijun Chen, Yan Ding, Chao Yu, Chenjia Bai, Xuelong Li

发表机构 * Institute of AI (TeleAI), China Telecom（人工智能研究院（TeleAI），中国电信）； Lumos Robotics（Lumos机器人）； University of Science and Technology of China（中国科学技术大学）； Northwestern Polytechnical University（西北工业大学）； Shanghai Jiao Tong University（上海交通大学）； East China University of Science and Technology（东华大学）； Harbin Engineering University（哈尔滨工程大学）； Fudan University（复旦大学）

AI总结提出VISTA框架，通过UMI-VQA数据集对齐视觉表示、物理验证流水线筛选可行轨迹以及两阶段联合训练，解决UMI数据训练VLA模型时的视觉分布偏移和物理不可行动作问题。

详情

Comments: Corrected the typing error

AI中文摘要

通用操作接口（UMI）实现了无需特定硬件遥操作的可扩展真实世界机器人数据收集，但利用UMI数据训练大规模视觉-语言-动作（VLA）模型仍然面临根本性挑战。我们识别出两个关键不匹配：腕部安装的鱼眼视图具有严重的径向畸变和以夹爪为中心的局部视角，对于预训练VLM而言是分布外数据；人类收集的轨迹经常违反运动学限制、发生碰撞或超出控制器带宽，导致VLA策略学习到物理上不可行的动作。为解决这些挑战，我们提出了VISTA框架，通过三个协同组件弥合这一双重差距。(i) UMI-VQA，首个专门针对腕部鱼眼观测的大规模VQA数据集，通过辅助视觉-语言监督将VLM表示对齐到畸变视觉领域。(ii) 系统性的物理验证流水线，在训练前进行数据完整性预检查，并对每条有效轨迹的轨迹连续性、自碰撞风险和执行保真度进行评分。(iii) 两阶段联合训练方案，在UMI-VQA上联合学习视觉-语言基础，并在验证轨迹上学习动作预测。我们的实验经验表明，引入UMI-VQA能持续提升下游策略性能，且物理验证分数对部署成功具有强预测性。在多种仿真和真实世界操作任务中，VISTA显著优于包括$π_{0.5}$、LingBot-VLA和Wall-X在内的强基线。我们向社区发布了物理验证流水线、UMI-VQA、验证轨迹数据和预训练模型。

英文摘要

Universal Manipulation Interface (UMI) enables scalable real-world robot data collection without hardware-specific teleoperation, yet leveraging UMI data to train large-scale Vision-Language-Action (VLA) models remains fundamentally challenging. We identify two critical mismatches: wrist-mounted fisheye views, with severe radial distortion and local gripper-centric perspectives, are out-of-distribution for pretrained VLMs; and human-collected trajectories frequently violate kinematic limits, incur collisions, or exceed controller bandwidth, teaching VLA policies physically infeasible actions. To address the challenges, we present VISTA, a framework that bridges this dual gap through three synergistic components. (i)~UMI-VQA, the first large-scale VQA dataset tailored to wrist-mounted fisheye observations, aligns VLM representations to the distorted visual regime via auxiliary vision-language supervision. (ii)~A systematic physical-validation pipeline performs a data-completeness pre-check and scores each valid trajectory for trajectory continuity, self-collision risk, and execution fidelity before it enters training. (iii)~A two-stage co-training recipe jointly learns vision-language grounding on UMI-VQA and action prediction on validated trajectories. Our experiments empirically show that incorporating UMI-VQA consistently improves downstream policy performance, and that physical-validation scores are strongly predictive of deployment success. On diverse simulation and real-world manipulation tasks, VISTA significantly outperforms strong baselines including $π_{0.5}$, LingBot-VLA, and Wall-X. We release the physical-validation pipeline, UMI-VQA, validated trajectory data, and the pre-trained model for the community.

URL PDF HTML ☆

赞 0 踩 0

2606.04672 2026-06-05 cs.LG cs.AI

Learning Long Range Spatio-Temporal Representations over Continuous Time Dynamic Graphs with State Space Models

利用状态空间模型学习连续时间动态图中的长程时空表示

Ayushman Raghuvanshi, Thummaluru Siddartha Reddy, Sundeep Prabhakar Chepuri, Mahesh Chandran

发表机构 * University of Texas at Austin（德克萨斯大学奥斯汀分校）； Indian Institute of Science（印度科学研究院）

AI总结提出一种基于状态空间模型的连续时间动态图框架（CTDG-SSM），通过拓扑感知的高阶多项式投影算子（CTT-HiPPO）实现长程时空信息传播，在动态链接预测、节点分类和序列分类任务上取得最优性能。

详情

Comments: Accepted at ICML 2026

AI中文摘要

连续时间动态图（CTDG）为捕捉演化关系数据中的细粒度时间模式提供了更丰富的框架。长程信息传播是学习表示时的关键挑战，其中需要在长时间跨度上保留和更新信息。现有方法限制模型捕捉一跳或局部时间邻域，无法捕捉多跳或全局结构模式。为解决此问题，我们从第一性原理推导出一个参数高效的连续时间动态图状态空间建模框架（CTDG-SSM）。我们首先引入连续时间拓扑感知高阶多项式投影算子（CTT-HiPPO），这是一种基于记忆的HiPPO新公式，用于联合编码时间动态和图结构。CTT-HiPPO的解通过将经典HiPPO解投影到拉普拉斯矩阵的多项式上获得，产生拓扑感知的记忆更新，该更新等价于CTDG的状态空间公式（CTDG-SSM）。然后，使用零阶保持方法获得计算高效的离散公式用于模型实现。在动态链接预测、动态节点分类和序列分类的基准测试中，CTDG-SSM实现了最先进的性能。值得注意的是，在需要长程时间（LRT）和空间推理的数据集上，它取得了较大的性能提升。

英文摘要

Continuous-time dynamic graphs (CTDGs) provide a richer framework to capture fine-grained temporal patterns in evolving relational data. Long-range information propagation is a key challenge while learning representations, wherein it is important to retain and update information over long temporal horizons. Existing approaches restrict models to capture one-hop or local temporal neighborhoods and fail to capture multi-hop or global structural patterns. To mitigate this, we derive a parameter-efficient state-space modeling framework for continuous-time dynamic graphs (CTDG-SSM) from first principles. We first introduce continuous-time Topology-Aware higher order polynomial projection operator (CTT-HiPPO), a novel memory-based reformulation of HiPPO to jointly encode temporal dynamics and graph structure. The solution from CTT-HiPPO is obtained by projecting the classical HiPPO solution through a polynomial of the Laplacian matrix, yielding topology-aware memory updates that admit an equivalent state-space formulation for CTDGs (CTDG-SSM). Then a computationally efficient discrete formulation is obtained using the zero-order hold approach for model implementation. Across benchmarks on dynamic link prediction, dynamic node classification, and sequence classification, CTDG-SSM achieves state-of-the-art performance. Notably, it achieves large performance gains on datasets that require long range temporal (LRT) and spatial reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.04560 2026-06-05 cs.LG cs.AI

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

基于轨迹级别优势优先经验回放的GRPO

Gyeongtae Yoo, Sanghyeok Park, Soohyuk Jang, Ik-hwan Kim, Sungroh Yoon

发表机构 * Department of Electrical and Computer Engineering, Seoul National University（首尔国立大学电子与计算机工程系）； Interdisciplinary Program in AI, Seoul National University（首尔国立大学人工智能跨学科项目）； AIIS, ASRI, INMC, and ISRC, Seoul National University（首尔国立大学人工智能研究所、人工智能研究机构、智能网络与计算中心及人工智能科学研究中心）

AI总结针对GRPO样本效率低的问题，提出轨迹级经验回放缓冲器，通过年龄驱逐限制陈旧性、新鲜锚定组合保持在线策略、按优势幅度优先采样，在多个数学基准上显著提升性能。

详情

AI中文摘要

基于可验证奖励的GRPO强化学习是后训练推理LLM的标准方法，但样本效率低下。每个轨迹仅用于一次梯度更新后被丢弃。朴素回放在此设置中不适用，因为LLM策略每步梯度变化快，存储的轨迹会变得陈旧并破坏训练稳定性。我们提出一种面向GRPO的轨迹级回放缓冲器，存储和采样单个轨迹而非整组。缓冲器通过年龄驱逐限制陈旧性：任何超过tau_max训练步数的轨迹被移除。缓冲器还通过新鲜锚定组合保留在线策略数据：每个批次保留其新鲜的在线策略轨迹，并拼接从缓冲器中单独抽取的回放轨迹。我们按每个轨迹的优势幅度进行优先回放，并回收优势大的单个轨迹。在三个Qwen3-Base规模、五个数学基准上，我们的方法优于GRPO和朴素回放基线。所有规模均获得正向增益，且随模型增大而增长。最大增益在4B规模上，五个基准平均提升+4.35个百分点。在联合衡量准确率和token效率的AES指标下，与GRPO的效率差距同样在4B最大，为+0.579。

英文摘要

Reinforcement learning from verifiable rewards with GRPO is a standard approach for post-training reasoning LLMs. It remains sample inefficient. Each rollout is used for a single gradient update and then discarded. Naive replay is not well suited in this setting because LLM policies drift quickly per gradient step. Stored rollouts therefore become stale and can destabilize training. We propose a rollout-level replay buffer for GRPO that stores and samples individual rollouts rather than whole groups. The buffer bounds staleness through age eviction. Any rollout older than tau_max training steps is removed. The buffer also preserves on-policy data via fresh-anchored composition. Each batch keeps its fresh on-policy rollouts and then concatenates replay rollouts drawn separately from the buffer. We prioritize replay by per-rollout advantage magnitude and recycle individual rollouts whose advantages are large. Across three Qwen3-Base scales on five math benchmarks, our method outperforms GRPO and naive replay baselines. Gains are positive at every scale and grow with model size. The largest gain is +4.35 pp on the five-benchmark average at 4B. Under an AES metric that jointly measures accuracy and token efficiency, the efficiency margin over GRPO is again largest at 4B, at +0.579.

URL PDF HTML ☆

赞 0 踩 0

2606.04485 2026-06-05 cs.LG

LimiX-2M: Mitigating Low-Rank Collapse and Attention Bottlenecks in Tabular Foundation Models

LimiX-2M：缓解表格基础模型中的低秩坍塌和注意力瓶颈

Yuanrui Wang, Xingxuan Zhang, Han Yu, Mingchao Hao, Gang Ren, Hao Yuan, Li Mao, Yunjia Zhang, Chun Yuan, Peng Cui

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出统一tokenize-and-route框架LimiX-2M，通过RaBEL扩展标量为局部RBF特征并重新排序双向块S→N→F，以2M参数超越更大模型，改善表格基础模型的精度-效率权衡。

详情

Comments: Accepted to ICML 2026

AI中文摘要

表格基础模型（TFM）日益与树集成方法竞争，但其性能通常计算效率低下：使用标准仿射标量分词时，每个特征通过本质上的一维通道注入值变化，特征ID/位置信号无法增加特征内值的自由度，导致早期层值敏感性弱和隐藏状态冗余。我们提出了一个统一的\emph{tokenize-and-route}框架用于强TFM： extbf{RaBEL}将每个标量扩展为紧凑的局部RBF特征（可选指数门控）以改善条件和浅层有效秩，而重新排序的双向块 extbf{S$ ightarrow$N$ ightarrow$F}通过在特征混合前聚合跨样本上下文并使用注意力池化来使计算与读出对齐。这些变化共同产生了 extbf{LimiX-2M}，一个2M参数模型，在广泛使用的表格基准上优于更大的TabPFN-v2和TabICL基线，同时降低了训练和推理成本。这些结果突出了值感知分词和读出对齐路由作为改善TFM中精度-效率权衡的关键杠杆。模型检查点和推理代码可在https://github.com/limix-ldm-ai/LimiX获取。

英文摘要

Tabular foundation models (TFMs) increasingly rival tree ensembles, but their performance is often compute-inefficient: with standard affine scalar tokenization, each feature injects value variation through an essentially one-dimensional channel, and feature IDs/positional signals cannot increase within-feature value degrees of freedom, yielding weak early-layer value sensitivity and redundant hidden states. We present a unified tokenize-and-route framework for strong TFMs: RaBEL expands each scalar into compact localized RBF features (optionally exponent-gated) to improve conditioning and shallow-layer effective rank, while a reordered bidirectional block S->N->F aligns computation with the readout by aggregating cross-sample context before feature mixing and using attention pooling. Together, these changes yield LimiX-2M, a 2M-parameter model that outperforms larger TabPFN-v2 and TabICL baselines on widely used tabular benchmarks while reducing training and inference costs. These results highlight value-aware tokenization and readout-aligned routing as key levers for improving the accuracy--efficiency trade-off in TFMs. Model checkpoints and inference code are available at https://github.com/limix-ldm-ai/LimiX.

URL PDF HTML ☆

赞 0 踩 0

2606.04463 2026-06-05 cs.RO

OSCAR: Omni-Embodiment Action-Conditioned World Model for Robotics

OSCAR: 面向机器人的全具身骨架条件世界动作模型

Zhuoyuan Wu, Jun Gao

发表机构 * Peking University（北京大学）； University of Michigan（密歇根大学）； NVIDIA（英伟达）

AI总结提出OSCAR，一种基于动作条件的视频世界模型，通过大规模数据管道和2D骨架渲染统一表示，实现跨机器人具身的泛化，并用于策略评估。

详情

Comments: Project page: https://wuzy2115.github.io/oscar-project-page/

AI中文摘要

我们提出OSCAR，一种精确的动作条件视频世界模型，能够泛化到不同的机器人具身并支持机器人策略评估。现有的视频世界模型在真实机器人评估中面临三个主要挑战：当前机器人训练数据集的场景多样性有限、动作跟随不精确、以及跨具身泛化能力差以支持广泛采用。我们从两个角度应对这些挑战。其核心是一个大规模标准化数据管道，用于整理、过滤和去重广泛的机器人和以自我为中心的人类数据集，产生一个涵盖多样化任务、场景、动作和机器人具身的干净联合训练数据集。为了给视频模型提供条件，我们采用2D运动学骨架渲染作为统一的条件表示，能够泛化到不同的机器人手臂甚至人类手部。我们在单个GH200 GPU上微调Cosmos-Predict2.5-2B模型。与现有基线相比，我们的模型在动作跟随、外观质量和运动一致性方面取得了显著改进，而基线要么模型规模大得多，要么需要更多GPU。我们进一步将OSCAR部署到RoboArena中评估机器人策略。大量实验表明，OSCAR中的虚拟策略评估与真实世界评估之间存在显著相关性，为未来机器人策略可以纯粹在虚拟生成的世界中评估铺平了道路。

英文摘要

We present OSCAR, a precise action-conditioned video world model that generalizes across different robot embodiments and enables robot policy evaluation. Existing video world models face three main challenges for real-world robot evaluation: limited scenario diversity in current robot training datasets, imprecise action following, and poor generalization across embodiments for broad adoption. We tackle these challenges from two perspectives. At its core is a large-scale standardized data pipeline that curates, filters, and deduplicates broad robotics and egocentric human datasets, yielding a clean joint-training dataset that spans diverse tasks, scenarios, actions, and robot embodiments. To condition the video model, we adopt 2D kinematic skeleton rendering as a unified conditioning representation that generalizes across different robot arms or even human hands. We finetune the Cosmos-Predict2.5-2B model on a single GH200 GPU. Our model achieves significant improvement on action following, appearance quality, and motion consistency, compared to existing baselines, which either have a much larger model size or require more GPUs. We further deploy OSCAR to evaluate robot policies from RoboArena. Extensive experiments demonstrate the significant correlation between our virtual policy evaluation in OSCAR and real-world evaluation, paving the way for the future where robot policies can be purely evaluated in virtual generated worlds.

URL PDF HTML ☆

赞 0 踩 0

2606.04335 2026-06-05 cs.LG cs.SY eess.SY

Policy Gradient for Continuous-Time Robust Markov Decision Processes

连续时间鲁棒马尔可夫决策过程的策略梯度

Tanya Veeravalli, David M. Bossens, Atsushi Nitanda

发表机构 * Centre for Frontier AI Research, Agency for Science, Technology and Research (A*STAR)（前沿人工智能研究中心，科技研究局（A*STAR））； Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR)（高性能计算研究所，科技研究局（A*STAR））； College of Computing and Data Science, Nanyang Technological University（计算与数据科学学院，南洋理工大学）

AI总结本文针对连续时间鲁棒马尔可夫决策过程，推导了策略梯度和对抗梯度，并提出双循环优化器和平均场优化器，分别实现线性收敛和亚线性收敛，同时给出了样本复杂度分析。

详情

AI中文摘要

鲁棒马尔可夫决策过程（RMDPs）框架允许设计在最坏情况转移动态下满足性能保证的强化学习智能体。传统RMDPs考虑离散时间动态，最近，样本高效的策略梯度算法已在此背景下被研究。本文研究连续时间RMDP框架内的策略梯度算法。使用随机和常微分方程的路径和伴随公式推导策略梯度和对抗梯度。我们提出双循环优化器，在基于oracle的设置中获得线性收敛，在基于样本的设置中获得$ ilde{\mathcal{O}}( rac{1}{ε^2})$样本复杂度，该分析还推导了无折扣总成本MDP框架的新工具。此外，我们提出平均场优化器作为分布优化器，具有$ ilde{\mathcal{O}}( rac{1}{K})$的基于oracle的收敛速率和$N$粒子近似下$ ilde{\mathcal{O}}( rac{N^2}ε)$的样本复杂度。通过神经常微分方程动态的连续时间RMDPs，两种优化器的连续时间策略梯度算法的有效性得到确认。

英文摘要

The framework of robust Markov decision processes (RMDPs) allows the design of reinforcement learning agents that satisfy performance guarantees under worst-case transition dynamics. Traditional RMDPs consider discrete-time dynamics and recently, sample-efficient policy gradient algorithms have been considered in this context. This paper investigates policy gradient algorithms within a continuous-time RMDP framework. Policy gradients and adversarial gradients are derived using pathwise and adjoint-based formulas for stochastic and ordinary differential equations. We propose double-loop optimisers to obtain linear convergence in the oracle-based setting and an $\tilde{\mathcal{O}}(\frac{1}{ε^2})$ sample complexity in the sample-based setting in an analysis which also derives novel tools for the framework of undiscounted total cost MDPs. Additionally, we propose mean-field optimisers as distributional optimisers with an $\tilde{\mathcal{O}}(\frac{1}{K})$ oracle-based convergence rate and an $\tilde{\mathcal{O}}(\frac{N^2}ε)$ sample complexity under $N$-particle approximation. The effectiveness of continuous-time policy gradient algorithms is confirmed for both optimisers on continuous-time RMDPs with neural ordinary differential equation dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.04037 2026-06-05 cs.AI cs.LG cs.SE

Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification

面向企业AI代理的部署前保障：基于本体的仿真与信任认证

Thanh Luong Tuan, Abhijit Sanyal

发表机构 * Golden Gate University（金门大学）； Data, Digital & IT, Novartis Healthcare Pvt. Ltd.（数据、数字与IT，诺华健康护理私人有限公司）

AI总结提出一种基于本体的验证框架，通过本体驱动的场景生成和信任证书，实现企业AI代理在部署前的自动化监管合规与安全认证。

详情

Comments: 26 pages, 3 figures. Companion to arXiv:2604.00555. Code and data: https://github.com/frank-luongt/faos-research/tree/main/RA-6

AI中文摘要

企业人工智能（AI）代理的部署前验证仍然是大型语言模型（LLM）能力基准测试与生产部署之间的关键缺口。一旦代理在生产环境中运行，部署后监控、人在回路控制和提示级护栏提供的保障有限。我们提出了一种基于本体的验证框架，包含三个组件：一个代理操作范围，形式化了跨权限、领域约束、安全属性、治理规则和自主级别的认证空间；一个本体到场景的生成流水线，自动推导出监管、操作和对抗性测试场景；以及一个信任证书，携带机器可验证的证明，并附带分级部署裁决（批准、有条件、拒绝）。在四个受监管行业（金融科技、银行、保险和医疗保健）中进行的受控试点，实例化为美国与越南的五个行业-监管体制单元，生成了1,800个场景，并针对125个主要来源监管要求和25个注入故障进行了评估。基于本体的生成（G4）实现了48.3%的监管覆盖率，而基于角色的基线为33.1%（校正后p=0.0006），并且领域特异性最高（4.77/5.0；p=2e-6）。在Bonferroni校正后，相对于基线和检索增强提示的覆盖率优势不再稳健。跨三个LLM家族（Claude Sonnet 4、Qwen 2.5 72B、Gemma 4 26B；总计5,400个场景）的交叉验证复制了角色与本体模式。结果表明，对于监管密集型领域，基于本体的场景生成可作为基于角色测试套件的可信补充。

英文摘要

Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post-deployment monitoring, human-in-the-loop controls, and prompt-level guardrails offer limited assurance once an agent is operating in production. We present an ontology-grounded verification framework -- to our knowledge the first to combine three components: an Agent Operational Envelope formalizing the certification space across permissions, domain constraints, safety properties, governance rules, and autonomy levels; an ontology-to-scenario generation pipeline that derives regulatory, operational, and adversarial test scenarios automatically; and a machine-verifiable Trust Certificate with graduated deployment verdicts. A controlled pilot across four regulated industries (Fintech, Banking, Insurance, Healthcare), instantiated as five industry-by-regulatory-regime cells across the United States and Vietnam (where Vietnam's 2025 AI Law makes such verification legally mandated for financial services), generated 1,800 scenarios evaluated against 125 primary-source regulatory requirements and 25 injected faults. Ontology-grounded generation significantly outperformed the dominant persona-based baseline on regulatory coverage (48.3% versus 33.1%; corrected p_c = .0006) and attained the highest domain specificity (4.77/5.0; p = 2e-6); transparently, its advantage over plain and retrieval-augmented prompting did not survive Bonferroni correction. Cross-validation across three LLM families (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B; 5,400 total scenarios) replicated the persona-versus-ontology pattern. The framework offers a reproducible, regulation-grounded route to pre-deployment assurance for enterprise AI agents, complementing runtime governance with an auditable deployment gate.

URL PDF HTML ☆

赞 0 踩 0

2606.04032 2026-06-05 cs.LG cs.AI cs.CL cs.PF

Do Transformers Need Three Projections? Systematic Study of QKV Variants

Transformer 需要三个投影吗？QKV 变体的系统研究

Ali Kayyam, Anusha Madan Gopal, M Anthony Lewis

发表机构 * Ali Kayyam ； Anusha Madan Gopal ； M Anthony Lewis

AI总结本文系统研究了注意力机制中查询、键、值投影共享的变体，发现 Q-K=V 共享在语言建模中仅以 3.1% 的困惑度损失实现 50% 的 KV 缓存减少，且与头共享结合可达到 96.9% 的缓存减少，从而支持设备端推理。

详情

Comments: Accepted at ICML 2026 (PMLR vol. 306). 26 pages, 12 figures, 16 tables. Code: https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections

AI中文摘要

Transformer 已成为各种 AI 任务的标准解决方案，其中查询、键和值（QKV）注意力公式起着核心作用。然而，这三个投影的各自贡献以及省略某些投影的影响仍知之甚少。我们系统评估了三种投影共享约束：a) Q-K=V（共享键-值），b) Q=K-V（共享查询-键），c) Q=K=V（单投影）。后两种变体产生对称注意力图；为了解决这个问题，我们还通过二维位置编码探索了非对称注意力。通过涵盖合成任务、视觉（MNIST、CIFAR、TinyImageNet、异常检测）和语言建模（在 10B 令牌上训练的 300M 和 1.2B 参数模型）的实验，我们发现我们的 Transformer 性能与 QKV Transformer 相当，有时甚至更好。在语言建模中，Q-K=V 投影共享实现了 50% 的 KV 缓存减少，仅导致 3.1% 的困惑度下降。关键的是，投影共享与头共享（GQA/MQA）互补：将 Q-K=V 与 GQA-4 结合可实现 87.5% 的缓存减少，而 Q-K=V + MQA 则达到 96.9%，从而实现了实用的设备端推理。我们表明，Q-K=V 保持了质量，因为键和值可以占据相似的表示空间，并且注意力在低秩机制下运行，而 Q=K-V 则破坏了注意力的方向性。我们的结果系统地将投影共享描述为注意力中权重绑定的一种未被充分探索的实例，具有直接、可量化的推理内存优势，尤其对边缘部署有价值。代码公开于 https://github.com/anushamadan02/Do-Transformers-Need-3-Projections。

英文摘要

Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection). The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings. Through experiments spanning synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling (300M and 1.2B parameter models on 10B tokens), we discovered that our transformers perform on par or occasionally better than the QKV transformer. In language modeling, Q-K=V projection sharing achieves 50% KV cache reduction with only 3.1% perplexity degradation. Crucially, projection sharing is complementary to head sharing (GQA/MQA): combining Q-K=V with GQA-4 yields 87.5% cache reduction, while Q-K=V + MQA achieves 96.9%, enabling practical on-device inference. We show that Q-K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low-rank regime, whereas Q=K-V breaks attention directionality. Our results systematically characterize projection sharing as an underexplored instance of weight tying in attention, with direct, quantifiable inference memory benefits, particularly valuable for edge deployment. The code is publicly available at https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections

URL PDF HTML ☆

赞 0 踩 0

2606.03785 2026-06-05 cs.CL

Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

后门遗忘泛化：走向移除大语言模型中未知触发器的路径

Lisa Bouger, Théo Lasnier, Philippe Loubet Moundi, Yannick Teglia, Djamé Seddah

发表机构 * Inria Paris（法国巴黎国家信息与自动化研究所）； Sorbonne Université（索邦大学）； Thales CDI（泰雷兹CDI）

AI总结本文通过实验证明，针对单个后门的遗忘训练可以泛化抑制其他未明确针对的后门，并引入交叉激活偏移距离量化不同训练引起的模型变化，为利用可控后门移除未知后门提供新方向。

详情

Comments: 22 pages, 28 figures

AI中文摘要

大语言模型中的后门攻击是一个日益严重的安全问题，模型可能生成对手选择的内容。现有防御一次只针对一个后门，并且通常需要知道触发器，这使得防御者在模型中可能存在未知后门时处于结构性劣势。我们表明，通过遗忘进行后门中和可以跨后门泛化：训练模型忽略单个触发器也可以抑制其他从未明确针对的后门。我们通过分析每次移除一个后门后获得的模型，研究了三个模型家族中的这一现象，这些模型的后门是通过预训练或持续预训练注入的。为了理解为什么遗忘某些后门会导致其他后门的抑制，我们引入了交叉激活偏移距离，以量化不同训练引起的模型变化之间的距离。我们的结果为LLM安全开辟了一个新方向，因为防御者可以故意注入受控后门然后移除它们，利用跨后门转移来抑制攻击者可能先前在模型中引入的未知后门。

英文摘要

Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, leaving the defender at a structural disadvantage when unknown backdoors may exist in a model. We show that backdoor neutralization through unlearning generalizes across backdoors: training a model to ignore a single trigger can also suppress other backdoors that were never explicitly targeted. We study this phenomenon across three model families, whose backdoors were injected via pretraining or continual pretraining, by analyzing the models obtained after removing one backdoor at a time. To understand why unlearning certain backdoors induces the suppression of others, we introduce the Cross Activation Shift Distance, to quantify the distance between model changes induced by different trainings. Our results open a new direction for LLM safety as defenders could deliberately inject controlled backdoors and then remove them, leveraging cross-backdoor transfer to also suppress unknown backdoors that an attacker may have previously introduced in the model.

URL PDF HTML ☆

赞 0 踩 0

2606.03730 2026-06-05 cs.CV

Beyond False Stability: High-Noise Drift Gating for Test-Time Adversarial Defenses in Vision-Language Models

超越虚假稳定性：面向视觉语言模型测试时对抗防御的高噪声漂移门控

Hashmat Shadab Malik, Muzammal Naseer, Salman Khan

发表机构 * Mohamed Bin Zayed University of AI, UAE（穆罕默德·本·扎耶德人工智能大学，阿联酋）； Khalifa University, UAE（卡布斯大学，阿联酋）； Australian National University, Australia（澳大利亚国立大学，澳大利亚）

AI总结针对视觉语言模型在测试时易受对抗攻击的问题，提出一种无训练、即插即用的高噪声漂移门控机制，通过检测高噪声下的特征不稳定性触发防御，改善了干净-鲁棒性权衡。

详情

AI中文摘要

视觉语言模型（如CLIP）展现出强大的零样本泛化能力，但极易受到对抗攻击。对抗训练能提升鲁棒性但计算成本高昂，因此推动了测试时防御的研究。近期方法利用CLIP视觉表示对随机扰动的响应：聚合噪声视图的预测、构建高斯噪声平均锚点并将特征向锚点插值、或应用反扰动。这些策略提升了鲁棒性，但往往降低了干净准确率，导致不利的干净-鲁棒权衡。我们重新审视随机测试时防御，并发现CLIP表示空间中一个未被充分探索的噪声区域转变。先前工作主要在弱噪声区域探索扰动，其中对抗样本可能表现出异常稳定性（虚假稳定性）。我们的分析表明，随着扰动强度增加，这种稳定性发生逆转：在弱噪声区域之外，对抗表示变得比干净表示明显更不稳定，提供了更清晰的分离信号。这种转变在均匀噪声和高斯噪声、光度变换和几何变换、不同数据集以及多种攻击下均一致。在对抗训练模型中，该转变基本消失，表明其与非鲁棒CLIP中对抗表示的脆弱局部盆地几何结构相关。我们提出一种无训练、即插即用的漂移门控机制，利用高噪声特征漂移作为轻量级门控信号，仅在检测到类似对抗的不稳定性时触发现有测试时防御。在13个数据集上，该方法一致改善了干净-鲁棒权衡。在8个细粒度数据集上，反攻击防御的平均干净+对抗准确率从65.7%提升至71.4%，噪声锚定防御从68.4%提升至73.2%；在ImageNet及其四个变体上，分别从56.1%提升至66.2%和从62.1%提升至67.6%。

英文摘要

Vision-language models (VLMs) such as CLIP show strong zero-shot generalization but remain highly vulnerable to adversarial attacks. Adversarial training improves robustness but is computationally expensive, motivating test-time defenses. Recent approaches exploit how CLIP's visual representations respond to stochastic perturbations: aggregating predictions across noisy views, constructing Gaussian noise-averaged anchors and interpolating features toward them, or applying counter-perturbations. These strategies improve robustness but often degrade clean accuracy, yielding an unfavorable clean-robust trade-off. We revisit stochastic test-time defenses and identify an underexplored noise-regime transition in CLIP's representation space. Prior work explored perturbations mainly in the weak-noise regime, where adversarial examples can appear unusually stable (false stability). Our analysis shows this reverses as perturbation strength grows: beyond the weak-noise regime, adversarial representations become markedly more unstable than clean ones, giving a clearer separation signal. The transition is consistent across uniform and Gaussian noise, photometric and geometric transforms, datasets, and diverse attacks. It largely disappears in adversarially trained models, suggesting it is tied to the fragile local-basin geometry of adversarial representations in non-robust CLIP. We propose a training-free, plug-in drift-gated mechanism that uses high-noise feature drift as a lightweight gating signal to trigger existing test-time defenses only when adversarial-like instability is detected. Across 13 datasets it consistently improves the clean-robust trade-off. On eight fine-grained datasets, mean clean+adversarial accuracy rises from 65.7% to 71.4% for counterattack defenses and 68.4% to 73.2% for noise-anchoring; on ImageNet and four shifted variants, from 56.1% to 66.2% and 62.1% to 67.6%.

URL PDF HTML ☆

赞 0 踩 0

2606.03189 2026-06-05 cs.CL

SenseJudge: Human-Centric Preference-Driven Judgment Framework

SenseJudge: 以人为中心的偏好驱动判断框架

Rui Li, Junfeng Liu, Xiangwen Kong, Linhai Xu, Zhifang Sui

发表机构 * State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University（信息处理国家重点实验室，计算机学院，北京大学）； StepFun ； Xi’an Jiaotong University（西安交通大学）

AI总结提出SenseJudge框架和SenseBench基准，通过融入用户偏好实现个性化判断和模型排名，实验证明其优于现有方法。

详情

Comments: ACL 2026 Findings

AI中文摘要

大型语言模型（LLMs）作为判断器在评估模型响应等各种场景中正成为一种日益被接受的范式。然而，现有的判断方法通常依赖于使用固定偏好数据训练的评判器，这往往忽视了多样化的用户偏好，难以适应真实的人机对话场景。为了解决这些局限性，我们提出了SenseJudge，一个由人类偏好驱动的可定制判断框架，以及SenseBench，一个源自真实世界多轮交互的多样且具有挑战性的指令遵循基准。我们将自动判断框架和基准应用于两个任务：（1）LLMs作为个性化判断器，以及（2）模型排名。我们进行了大量实验，结果表明SenseJudge框架在LLMs作为个性化判断器任务中超越了其他判断方法和模型，并实现了与真实人类感知一致的模型排名。此外，我们对位置偏差和一致性进行了分析，并进行了消融研究，证实了SenseJudge的鲁棒性。

英文摘要

Large Language Models (LLMs) as judges across various scenarios such as assessing model responses is becoming an increasingly accepted paradigm. However, existing judgment approaches often rely on trained judgers using fixed preference data, which tend to overlook diverse user preferences and struggle to adapt to real-world human-AI dialogue scenarios. To address these limitations, we propose SenseJudge, a customizable judgment framework driven by human preferences and SenseBench, a diverse and challenging instruction-following benchmark derived from real-world multi-turn interactions. We applied the automatic judgment framework and benchmark to two tasks: (1) LLMs as personalized judges, and (2) model ranking. We conducted extensive experiments, and the results demonstrate that the SenseJudge framework surpasses other judgment methods and models in the LLMs-as-personalized-judges task and achieves model ranking that aligns with real human sense. Additionally, we conducted analyses on position bias and consistency, alongside ablation studies, which affirmed the robustness of SenseJudge.

URL PDF HTML ☆

赞 0 踩 0

2606.03100 2026-06-05 cs.CV cs.LG

Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation

零样本3D问答通过层级视图到令牌传输

Dongsheng Wang, Dawei Su, Hui Huang

发表机构 * Dongsheng Wang（王东生）； Dawei Su（苏大卫）； Hui Huang（黄慧）

AI总结提出KeyVT方法，通过层级视图和令牌级输入上下文收集，结合像素特征与相机参数评估视图重要性，并利用最优传输识别代表性令牌，实现零样本3D问答性能提升。

详情

Comments: Accepted at ICML 2026. 19 pages, 6 figures

AI中文摘要

最近，通过2D视觉-语言模型（VLM）进行零样本3D场景理解因其有前景的空间推理能力而受到越来越多的研究关注。通常，从3D点云中采样多个2D视图，并输入预训练的VLM以回答给定问题。这种范式凸显了输入上下文质量的关键作用，并提出了在有限输入预算下尽可能保留与任务相关的3D细节的挑战。我们提出了 exttt{KeyVT}，一种在视图和令牌级别进行输入上下文收集的层级方法。具体来说，我们将像素特征与相机参数结合，并基于语义内容和几何位置评估视图重要性，从而得到空间一致且与任务相关的视图。此外，我们通过最优传输（OT）框架识别代表性令牌来解决选定视图中补丁之间的冗余问题，其中视图令牌和关键令牌被公式化为嵌入空间中的两个离散分布。这些关键令牌通过最小化OT距离期望覆盖所有视图特征。我们在三个广泛使用的基准上评估了我们的框架，结果表明与现有的无调优方法相比有显著改进，并且性能与基于训练的方法相当。

英文摘要

Recently, zero-shot 3D scene understanding via 2D Vision-Language Models (VLMs) has gained increasing research interest due to their promising spatial reasoning capabilities. Typically, multiple 2D views are sampled from a 3D point cloud and fed into pre-trained VLMs to answer a given question. This paradigm highlights the critical role of input context quality and raises the challenge of retaining as many task-relevant 3D details as possible under a limited input budget. We propose \texttt{KeyVT}, a hierarchical approach for input context collection at both the view and token levels. Specifically, we combine pixel features with camera parameters and assess view importance based on both semantic content and geometric position, resulting in spatially consistent and task-relevant views. Furthermore, we address redundancy among patches across selected views by identifying representative tokens under the optimal transport (OT) framework, where view tokens and key tokens are formulated as two discrete distributions in the embedding space. These key tokens are expected to cover all view features by minimizing the OT distance. We evaluate our framework on three widely used benchmarks, demonstrating significant improvements over existing tuning-free methods and performance comparable to training-based approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.03070 2026-06-05 cs.LG cs.AI

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

ASymPO: 用于异步大语言模型后训练的非对称尺度策略优化（无需行为信息）

Zehua Liu, Yuxuan Yao, Xiaojin Fu, Tao Zhong, Mingxuan Yuan

发表机构 * Huawei Technologies（华为技术）

AI总结针对异步强化学习中陈旧响应导致的分布漂移问题，提出非对称尺度策略优化（ASymPO），通过归一化每个响应的令牌损失来恢复零和平衡，无需行为策略概率。

详情

Comments: incorrect proofs in the paper

AI中文摘要

异步强化学习通过将响应生成与策略优化解耦来提高语言模型后训练的吞吐量，但陈旧响应会引入分布漂移。标准的行为校正方法通过行为策略概率、重要性比率或裁剪来控制这种漂移，这需要在推出和学习系统之间具有令牌对齐、版本化和数值一致的行为对数概率。我们探究是否仅使用当前策略概率就能稳定异步组相对强化学习。我们识别出一种尺度不平衡失败模式：当在当前策略下评估陈旧响应时，正负损失项可能出现在不同的负对数概率尺度上，因此零和优势不再意味着平衡的损失贡献。我们提出非对称尺度策略优化（ASymPO），它通过每个响应的当前平均令牌负对数概率来归一化其令牌损失。ASymPO不需要行为策略概率，恢复了响应级别的零和平衡，并保留了非零的学习信号。我们还引入了固定负缩放基线——缩放策略优化（SPO），并在异步数学推理后训练中评估了这两种仅当前策略的目标函数。

英文摘要

Asynchronous reinforcement learning can improve language-model post-training throughput by decoupling response generation from policy optimization, but stale responses introduce distribution drift. Standard behavior-corrected methods control this drift with behavior-policy probabilities, importance ratios, or clipping, which requires token-aligned, versioned, and numerically consistent behavior log-probabilities across rollout and learner systems. We ask whether asynchronous group-relative RL can instead be stabilized using only current-policy probabilities. We identify a scale-imbalance failure mode: when stale responses are evaluated under the current policy, positive and negative loss terms can appear at different negative log-probability scales, so zero-sum advantages no longer imply balanced loss contributions. We propose Asymmetric-Scale Policy Optimization (ASymPO), which normalizes each response's token loss by its current average token negative log-probability. ASymPO requires no behavior-policy probabilities, restores response-level zero-sum balance, and preserves a nonzero learning signal. We also introduce Scaled Policy Optimization (SPO), a fixed negative-scaling baseline, and evaluate both current-policy-only objectives in asynchronous mathematical reasoning post-training.

URL PDF HTML ☆

赞 0 踩 0

2606.02907 2026-06-05 cs.CL cs.AI

Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

线性探针检测语言模型隐藏状态中的任务格式，而非推理模式

Subramanyam Sahoo, Vinija Jain, Aman Chadha, Divya Chaudhary

发表机构 * Horizon Research（远景研究）； Meta ； Apple（苹果公司）； Northeastern University（东北大学）

AI总结通过线性探针实验发现，大语言模型隐藏状态中看似分离的推理模式实际上由任务格式（如来源、选项数、响应长度）混淆导致，而非真正的推理计算结构。

详情

Comments: Accepted in the 6th Workshop on Trustworthy NLP, ACL 2026

AI中文摘要

线性探针广泛用于声称大语言模型（LLM）隐藏状态对不同推理类型学习到不同表示。我们通过在经典三分法基准（LogiQA 2.0（演绎）、ARC-Challenge（归纳）和$\alpha$NLI（溯因））上探测Qwen3-14B来检验这一说法。在40层中的第32层，线性探针达到100%交叉验证准确率，且几何结构良好分离（本征维度：20.6、28.5、33.6；凸包污染$\leq$1.5%）。然而，这种分离完全由格式混淆驱动。对来源身份、选项数和响应长度进行残差化后，准确率降至随机水平。轨迹锚点相似性表明任务间推理大部分共享（42.5%一致性 vs. 33.3%随机），且随机对照因果操控（$n=20$）显示几何结构与推理模式之间无功能联系（$p=0.286$）。因此，高探针准确率反映的是任务格式而非计算结构，这促使在机制可解释性中常规性地进行格式去混淆。

英文摘要

Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the classical trichotomy: LogiQA 2.0 (deductive), ARC-Challenge (inductive), and $α$NLI (abductive). At layer 32 of 40, linear probes achieve 100\% cross-validated accuracy with well-separated geometry (intrinsic dimensionalities: 20.6, 28.5, 33.6; convex hull contamination $\leq$1.5\%). However, this separation is entirely driven by format confounds. Residualizing source identity, option count, and response length reduces accuracy to chance. Trace-anchor similarity indicates largely shared reasoning across tasks (42.5\% agreement vs.\ 33.3\% chance), and causal steering with random controls ($n=20$) shows no functional link between geometry and reasoning mode ($p=0.286$). Thus, high probe accuracy reflects task format rather than computational structure, motivating routine format deconfounding in mechanistic interpretability.

URL PDF HTML ☆

赞 0 踩 0

2606.02776 2026-06-05 cs.CL

Topics as Proxies for Sociodemographics: How Conversational Context Affects LLM Answers

话题作为社会人口统计的代理：对话上下文如何影响大语言模型的回答

Vera Neplenbroek, Gabriele Sarti, Arianna Bisazza, Raquel Fernández

发表机构 * Institute for Logic, Language and Computation, University of Amsterdam（逻辑、语言与计算研究所，阿姆斯特丹大学）； Khoury College of Computer Sciences, Northeastern University（计算机科学学院，东北大学）； Center for Language and Cognition, University of Groningen（语言与认知中心，格罗宁根大学）

AI总结研究大语言模型在高风险场景中对话上下文对回答差异的影响，发现话题是社会人口统计差异的主要驱动因素，且影响方式不可预测。

详情

AI中文摘要

当大语言模型（LLM）用于高风险场景（如法律、医疗和金融建议）时，即使单次对话历史也足以导致用户间结果差异。先前研究表明，这会导致社会人口统计群体之间的结果差异，某些群体获得比其他群体更有利的结果。在这项工作中，我们证明LLM实际上难以从单次对话历史推断用户的社会人口统计信息，并且尽管社会人口统计群体之间存在差异，但差异幅度很小。为了探究这些差异的主要驱动因素，我们将用户社会人口统计信息与对话的一系列（心理）语言学特征（包括对话话题、情感和可读性）进行比较。我们发现，在对话上下文中，对话话题最能预测LLM生成的建议，这些话题在一定程度上充当社会人口统计群体的代理，并且常常以不可预测的方式影响建议。这令人担忧，并强调未来研究需要更好地理解，并在必要时减轻高风险场景中对话上下文对LLM输出的影响。

英文摘要

When large language models (LLMs) are used in high-stakes scenarios, such as legal, medical and financial advice, even a single conversation history is enough to drive differences in outcomes between users. Prior work has demonstrated that this results in outcome disparities between sociodemographic groups, with some groups receiving more advantageous outcomes than others. In this work, we demonstrate that LLMs actually struggle to infer user sociodemographics from a single conversation history and that although there are disparities between sociodemographic groups, they are minimal in magnitude. To investigate what the main driver of these disparities is, we compare user sociodemographics to a range of (psycho)linguistic features of conversations, including conversation topic, emotions, and readability. We find that conversation topics are most predictive of LLM-generated advice within a conversational context, which, to some extent, function as proxies for sociodemographic groups and often affect advice in unpredictable ways. This is cause for concern and highlights the need for future research to better understand and, if needed, mitigate the effect of conversational context on LLM outputs in high-stakes scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.02750 2026-06-05 cs.CL

On the Persistent Effects of Lexicality in Large Language Models

论词汇性在大语言模型中的持久影响

Hammad Rizwan, Muhammad Umair Haider, Nishant Subramani, Mona T. Diab, A. B. Siddique, Hassan Sajjad

发表机构 * Dalhousie University（达尔豪斯大学）； University of Kentucky（肯塔基大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文通过对抗性语义压力测试和信息论视角，量化了大语言模型中词汇重叠相对于语义内容的影响，发现词汇影响贯穿模型深度，并在中间层出现词汇和语义信号同时衰减的过渡区域，进而以摘要和模型编辑为例展示了词汇影响对下游任务的作用。

详情

AI中文摘要

从大语言模型（LLMs）中提取的表征在许多下游应用中扮演着重要角色。然而，这些表征的结构往往受词汇重叠而非语义内容的影响。我们对这种词汇影响与语义内容之间的关系及其对下游任务的影响的理解仍然有限。在这项工作中，我们研究表征以量化词汇重叠相对于语义内容的影响。我们考虑了若干对抗性语义压力测试，并进一步将我们的发现与信息论视角联系起来。我们发现词汇影响贯穿模型的深度，在不同架构、训练范式和目标函数（包括为语义相似性训练的模型）中一致存在。此外，我们观察到一个中间深度区域，其中词汇和语义信号同时衰减，表明这是一个表征对表面形式和意义都较差的过渡状态。我们进一步通过摘要和模型编辑作为案例研究，展示了词汇影响对LLMs下游使用的影响。

英文摘要

Representations extracted from large language models (LLMs) play an important role in many downstream applications. However, the structure of these representations is often influenced by lexical overlap rather than semantic content. Our understanding of the relationship between this lexical influence and semantic content, and its implications for downstream tasks, remains limited. In this work, we investigate representations to quantify the effect of lexical overlap relative to semantic content. We consider several adversarial semantic stress tests and further connect our findings to the information theory perspective. We find that lexical influence extends across the depth of models, consistently across architectures, training regimes, and objective functions, including the models trained for semantic similarity. Moreover, we observe a mid-depth region in which both lexical and semantic signals degrade simultaneously, indicating a transitional regime where representations are poor for both surface form and meaning. We further demonstrate the effect of lexical influence on downstream uses of LLMs using summarization and model editing as a case study.

URL PDF HTML ☆

赞 0 踩 0

2606.02684 2026-06-05 cs.LG cs.AI cs.CL

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

先过滤，再重加权：重新思考在线策略蒸馏中的优化粒度

Yuying Li, Leqi Zheng, Yongzi Yu, Wenrui Zhou, Xuchang Zhong, Xing Hu, Jing Jin, Hangjie Yuan, Tao Feng

发表机构 * THU（清华大学）； HKUST（香港科技大学）； BIT（北京理工大学）； Meituan（美团）； ZJU（浙江大学）

AI总结针对在线策略蒸馏，提出FiRe-OPD方法，通过轨迹级过滤和令牌级软重加权实现细粒度优化，在多种设置下优于现有方法。

详情

AI中文摘要

大型语言模型中的在线策略蒸馏正从全轨迹KL监督转向更具选择性的训练范式。最近的在线策略蒸馏方法越来越关注选择哪些轨迹进行学习、哪些令牌信息量最大以及哪些监督信号最可靠。受此趋势启发，我们重新思考在线策略蒸馏的优化粒度，并提出FiRe-OPD（先过滤，再重加权），该方法在轨迹和令牌两个层面联合调整监督信号。具体来说，FiRe-OPD首先过滤轨迹以移除低质量的采样结果，然后在保留的轨迹内应用软重加权以强调信息丰富的令牌。与硬令牌选择相比，FiRe-OPD利用软加权机制有效减轻信息损失并增强优化稳定性，从而实现更细粒度的在线策略蒸馏优化。我们在强到弱、单教师和多教师设置中验证了FiRe-OPD的有效性，并展示了其相对于近期令牌级在线策略蒸馏方法的优越性（例如，在强到弱设置中AIME 2024上+6.25，在多教师设置中Miner上+18.81）。我们的代码可从此链接获取。

英文摘要

On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative, and which supervision signals are most reliable. Motivated by this trend, we rethink optimization granularity of OPD and propose \fireicon\ FiRe-OPD (Filter, then Reweight), which jointly adjusts supervision signals at both trajectory and token levels. In details, FiRe-OPD first filters trajectories to remove low-quality rollout samples, and then applies soft reweighting within the retained trajectories to emphasize informative tokens. Compared with hard token selection, FiRe-OPD leverages a soft-weighting mechanism to effectively mitigate information loss and enhance optimization stability, thereby achieving finer-grained OPD optimization. We validate the effectiveness of FiRe-OPD across strong-to-weak, single-teacher, and multi-teacher settings, and demonstrate its superiority over recent token-level OPD methods ( (e.g., +6.25 on AIME 2024 in strong-to-weak, +18.81 on Miner in multi-teacher). Our code is available at https://github.com/YuYingLi0/FiRe-OPD.

URL PDF HTML ☆

赞 0 踩 0