arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.11235 2026-06-11 cs.LG cs.DB stat.ME 新提交

Few-Shot Resampling for Scalable Statistically-Sound Data Mining

少样本重采样：可扩展的统计可靠数据挖掘

Leonardo Pellegrina, Fabio Vandin

发表机构 * Department of Information Engineering, University of Padova（帕多瓦大学信息工程系）

AI总结提出FewRS方法，基于重采样评估数据挖掘结果的统计显著性，通过推导新的上界偏差界，仅需极少量重采样数据集即可保证假发现概率，显著提升可扩展性。

详情

Comments: Accepted to KDD 2026

AI中文摘要

知识发现的一个关键步骤是评估数据挖掘结果。在包括模式挖掘、图分析等多个应用中，此步骤包括评估结果的统计显著性，以避免仅由噪声或数据随机波动导致的虚假发现。虽然针对某些特定应用已经开发了专门程序，但基于重采样的方法被广泛使用，尤其是在无法推导解析结果的复杂分析中。然而，当前基于重采样的方法需要生成和分析数千个重采样数据集，因此对于大型数据集或计算密集型分析不实用。本文中，我们介绍了FewRS，一种简单有效的基于重采样的方法，用于评估数据挖掘结果的统计显著性，并对错误发现概率提供严格保证。我们的方法可应用于任何使用重采样方法的情况。FewRS基于我们对表示数据挖掘结果质量的检验统计量的上确界偏差推导出的新界。我们证明FewRS需要生成和分析极少数量的重采样数据集，从而得到高度可扩展且广泛适用的方法。我们在常见任务（如模式挖掘和网络分析）上测试了我们的方法。在所有情况下，与现有技术相比，我们的方法在运行时间上减少了多达两个数量级，同时保持高统计功效，使得能够在大型真实世界数据集上对数据挖掘结果进行统计验证。

英文摘要

A key step in knowledge discovery is the evaluation of data mining results. In several applications, including pattern mining, graph analysis, and others, this step includes the evaluation of the statistical significance of the results, to avoid spurious discoveries due only to noise or random fluctuations in the data. While specialized procedures have been developed for some specific applications, resampling-based approaches are widely used, in particular for complex analyses where analytical results cannot be derived. However, current resampling-based approaches require the generation and analysis of thousands of resampled datasets, and are therefore impractical for large datasets or computationally intensive analyses. In this paper, we introduce FewRS, a simple and effective resampling-based approach to assess the statistical significance of data mining results with rigorous guarantees on the probability of false discoveries. Our approach can be used in every situation where resampling-based approaches are applied. FewRS builds on our derivation of a novel bound to the supremum deviation of test statistics representing the quality of data mining results. We prove that FewRS needs to generate and analyze an extremely small number of resampled datasets, leading to a highly scalable approach with wide applicability. We test our approach on common tasks such as pattern mining and network analysis. In all cases, our approach results in a reduction of up to two orders of magnitude in running time compared to the state of the art, while preserving high statistical power, enabling the statistical validation of data mining results on large-scale real-world datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.11233 2026-06-11 cs.CV 新提交

OSCS-SupCon: Orthogonal Sigmoid-based Common and Style Supervised Contrastive Learning for Robust Feature Disentanglement

OSCS-SupCon: 基于正交Sigmoid的通用与风格监督对比学习用于鲁棒特征解耦

Bin Wang, Fadi Dornaika

发表机构 * University of the Basque Country（巴斯克大学）； IKERBASQUE（伊克尔巴斯克）

AI总结针对监督对比学习中负样本稀释和特征空间纠缠问题，提出OSCS-SupCon框架，采用Sigmoid对比损失和正交约束，提升特征判别性和泛化能力。

详情

AI中文摘要

监督对比学习（SupCon）通过显式建模样本间的成对关系取得了强大性能。然而，现有基于SupCon的方法存在两个关键限制：标准InfoNCE损失导致的负样本稀释，以及缺乏分离类别相关（通用）和类别无关（风格）特征的显式约束引起的特征空间纠缠。这些限制降低了特征判别性和泛化能力。为解决这些问题，我们提出OSCS-SupCon（基于正交Sigmoid的通用与风格监督对比学习），一个结合Sigmoid成对对比目标与显式正交约束的统一框架。具体而言，我们引入一个具有两个可学习参数（温度和偏置）的Sigmoid对比损失，自适应地调整成对决策边界并缓解负样本稀释。此外，我们通过带ReLU非线性的线性投影强制通用和风格特征子空间之间的正交性，从而减少特征重叠并改善风格无关特征的解耦。在六个基准数据集上的大量实验表明，OSCS-SupCon在多种骨干架构上始终优于最先进的监督对比学习方法。特别是在使用ResNet-18骨干的细粒度CUB200-2011数据集上，所提方法相比CS-SupCon在分类准确率上提升了3.4%，突显了其鲁棒性和泛化能力。消融研究进一步证实了每个组件的有效性。

英文摘要

Supervised Contrastive Learning (SupCon) has achieved strong performance by explicitly modeling pairwise relationships among samples. However, existing SupCon-based methods suffer from two key limitations: negative-sample dilution induced by the standard InfoNCE loss, and feature-space entanglement caused by the lack of explicit constraints separating category-relevant (common) and category-irrelevant (style) features. These limitations reduce feature discriminability and generalization ability. To address these issues, we propose OSCS-SupCon (Orthogonal Sigmoid-based Common and Style Supervised Contrastive Learning), a unified framework that combines a sigmoid-based pairwise contrastive objective with explicit orthogonality constraints. Specifically, we introduce a sigmoid-based contrastive loss with two learnable parameters, temperature and bias, which adaptively modulate pairwise decision boundaries and alleviate negative-sample dilution. Furthermore, we enforce orthogonality between common and style feature subspaces via a linear projection with ReLU nonlinearity, thereby reducing feature overlap and improving disentanglement of style-irrelevant representations. Extensive experiments on six benchmark datasets demonstrate that OSCS-SupCon consistently outperforms state-of-the-art supervised contrastive learning methods across multiple backbone architectures. In particular, on the fine-grained CUB200-2011 dataset with a ResNet-18 backbone, the proposed method achieves a 3.4% improvement in classification accuracy over CS-SupCon, highlighting its robustness and generalization capability. Ablation studies further confirm the effectiveness of each component.

URL PDF HTML ☆

赞 0 踩 0

2606.11232 2026-06-11 cs.CL cs.AI 新提交

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

每个行为都有代价：前沿大语言模型中的压缩道德组合

Weijia Zhang, Ruiqi Chen, Yunze Xiao, Weihao Xuan

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Michigan（密歇根大学）； Carnegie Mellon University（卡内基梅隆大学）； The University of Tokyo（东京大学）

AI总结针对现有道德基准仅评估孤立行为偏好的不足，提出Moral Trolley Arena两阶段盲ELO基准，通过校准个体道德行为并组合为双行为项，发现前沿LLM的道德判断呈压缩而非简单加性关系。

详情

AI中文摘要

现有的LLM道德基准通常询问模型偏好哪个孤立的道德行为、价值或基础。这有用但不完整。现实判断往往要求模型在同一选项中组合多个道德信号。我们引入**Moral Trolley Arena**，一个两阶段盲ELO基准，用于衡量LLM如何组合道德证据。单场景阶段首先从跨越五个道德基础理论的229个场景语料库中校准个体道德行为；组合阶段则将校准后的行为组合成受控强度网格上的双行为道德项，并测量由此产生的组合偏好。在十个前沿模型中，组合判断主要由成分行为强度预测，但关系始终是压缩的而非简单加性。模型还表现出非加性强度锚定、成分控制后有限的基础特异性残差，以及跨提供者高度收敛的组合偏好曲面。这些结果表明，道德审计应衡量道德证据的组合规则，而不仅仅是对孤立行为的排名。

英文摘要

Existing LLM moral benchmarks usually ask which isolated moral act, value, or foundation a model prefers. This is useful but incomplete. Realistic judgments often require a model to combine several moral signals within the same option. We introduce **Moral Trolley Arena**, a two-stage blind ELO benchmark for measuring how LLMs compose moral evidence. The single-scene arena first calibrates individual moral acts from a 229-scenario corpus across five Moral Foundations Theory foundations; the composite arena then combines calibrated acts into two-act moral items over a controlled intensity grid and measures the resulting composite preferences. Across ten frontier models, composite judgments are largely predicted by component act strength, but the relation is consistently compressed rather than simply additive. Models also show non-additive intensity anchoring, bounded foundation-specific residuals after component control, and highly convergent composite preference surfaces across providers. These results suggest that moral audits should measure composition rules for moral evidence, not only rankings over isolated acts.

URL PDF HTML ☆

赞 0 踩 0

2606.11231 2026-06-11 cs.CV 新提交

CFCamo: A Counterfactual Detect-or-Abstain Framework for Camouflaged Object Detection

CFCamo：一种用于伪装目标检测的反事实检测或放弃框架

Suhang Li, Osamu Yoshie, Yuya Ieiri

发表机构 * Graduate School of Information, Production and Systems, Waseda University（早稻田大学信息生产系统研究生院）

AI总结提出CFCamo框架，通过反事实配对训练和策略优化，使COD模型在检测到目标时输出结果，在无目标时放弃检测，解决了正样本训练导致的过度检测偏差。

详情

Comments: 10 pages, 7 figures, 5 tables. Code and data: this https URL

AI中文摘要

视觉语言强化学习最近在伪装目标检测（COD）中展现出强大的目标存在定位能力。然而，定位只是决策的一方面：当智能体面对没有伪装目标的普通图像时，它是否仍会声称存在伪装目标？标准的COD训练和评估数据仅包含正样本，因此在此设置下优化的智能体会产生过度检测偏差，这是一种任务特定的物体幻觉形式，标准COD评估无法衡量。为了量化这种目标缺失行为，我们构建了反事实COD（CF-COD），一个配对基准，从每个留出的COD评估图像中移除伪装目标，同时保留合理的背景。CF-COD评估模型是否在原始图像上检测到目标，并在目标缺失的反事实图像上放弃检测，通过配对准确率（PA）总结。我们进一步引入了CFCamo，一个用于COD的配对反事实框架，支持放弃检测。在训练中，CFCamo使用反事实序列策略优化（CSPO）优化Qwen3-VL-4B-Instruct智能体，该策略采样配对的原始-反事实轨迹，并使用反事实配对奖励（CPR）将原始图像检测与反事实放弃耦合。在CAMO-test上，CFCamo相比先前基于RL的COD基线将S_alpha提高了3.7个百分点；在CF-COD上，它达到了80.0-90.8%的PA。消融实验表明，移除反事实耦合后，尽管目标存在COD得分很高，PA降至1.4-5.2%，这表明仅凭目标存在评估无法表征检测或放弃行为。总体而言，这些结果表明CFCamo通过将目标存在检测与目标缺失放弃耦合，而不仅仅是加强目标存在定位，改进了COD智能体。代码和数据可在https://this URL获取。

英文摘要

Vision-language reinforcement learning has recently shown strong target-present localization for camouflaged object detection (COD). Yet localization is only one side of the decision: when the agent faces an ordinary image with no camouflaged target, will it still claim that a camouflaged object exists? Standard COD training and evaluation data are positive-only, so agents optimized under this setting can acquire an over-detect bias, a task-specific form of object hallucination that standard COD evaluation leaves unmeasured. To quantify this target-absent behavior, we construct Counterfactual COD (CF-COD), a paired benchmark that removes the camouflaged target from each held-out COD evaluation image while preserving a plausible background. CF-COD evaluates whether a model detects the target on the original image and abstains on the target-absent counterfactual, summarized by Pair Accuracy (PA). We further introduce CFCamo, a paired counterfactual framework for COD with abstention. For training, CFCamo optimizes a Qwen3-VL-4B-Instruct agent with Counterfactual Sequence Policy Optimization (CSPO), which samples paired original-counterfactual rollouts and uses a Counterfactual Paired Reward (CPR) to couple original-image detection with counterfactual abstention. On CAMO-test, CFCamo improves S_alpha by +3.7 pp over the prior RL-based COD baseline; across CF-COD, it reaches 80.0-90.8% PA. Ablations show that removing counterfactual coupling reduces PA to 1.4-5.2% despite strong target-present COD scores, showing that target-present evaluation alone does not characterize detect-or-abstain behavior. Overall, these results indicate that CFCamo improves COD agents by coupling target-present detection with target-absent abstention, rather than merely strengthening target-present localization. Code and data are available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11222 2026-06-11 cs.CL cs.IT 新提交

A Geometric Profile of Semantic Information in Text: Frame-Conditional Uniqueness and a Trade-Off Triangle for Scalar Summaries

文本中语义信息的几何轮廓：帧条件唯一性与标量摘要的权衡三角形

Dmitriy Kompaneets

发表机构 * Independent Researcher（独立研究员）

AI总结提出一个几何框架，通过句子嵌入的结构测量文本语义内容，包括三个坐标（新颖性、广度、整合性），并证明任何标量摘要都无法同时满足分析稳定性、序数鲁棒性和跨表示可比性。

详情

Comments: 19 pages. Code and data: this https URL

AI中文摘要

一段文本承载了多少意义？香农的理论衡量符号上的不确定性，并有意忽略意义，而诸如BERTScore的成对度量比较两段文本而非表征单段文本。我们开发了一个几何框架，从文本句子嵌入的结构中测量语义内容。该框架包含三个部分。首先，在固定的嵌入和基线内，六个自然公理唯一确定一个标量度量（尺度可调），即帧条件唯一性定理。得到的标量在经验上过于粗糙，这促使我们寻求更丰富的表征。其次，我们提出一个三坐标语义轮廓，捕捉新颖性（与通用话语的偏离）、广度（不同思想的多样性）和整合性（它们之间的连通性），以及一个离散的最小单元（语义量子），其分辨率由聚类阈值$\tau$固定。第三，我们证明了一个不可能定理：轮廓的任何标量摘要都不能同时满足在释义和拼接下的分析稳定性、跨文本尺度的序数鲁棒性以及跨表征的可比性。我们展示了两个实用标量$S_{\mathrm{minmax}}$和$S_{\mathrm{rank}}$，每个占据这个权衡三角形的不同角落。在23个合成类别、5本Project Gutenberg小说和3个嵌入模型上的验证确认了该权衡。推荐的秩归一化配置在28个序数检验中通过25个（Benjamini-Hochberg校正后通过21个），优于包括单字熵和基于BERTScore的新颖性信号在内的七个基线。一个独立的变分结果将广度坐标与行列式点过程的对数行列式联系起来（在507个Gutenberg章节上Spearman $\rho = 0.985$），为广度提供了优化理论基础。

英文摘要

How much meaning does a text carry? Shannon's theory measures uncertainty over symbols and is intentionally indifferent to meaning, while pairwise metrics such as BERTScore compare two texts rather than characterizing one. We develop a geometric framework that measures semantic content from the structure of a text's sentence embeddings. The framework has three parts. First, within a fixed embedding and baseline, six natural axioms uniquely determine a scalar measure up to scale, a frame-conditional uniqueness theorem. The resulting scalar is empirically too coarse, motivating a richer representation. Second, we propose a three-coordinate semantic profile capturing novelty (displacement from generic discourse), breadth (diversity of distinct ideas), and integration (connectedness among them), together with a discrete minimal unit (the semantic quantum) whose resolution is fixed by a clustering threshold $\tau$. Third, we prove a no-go theorem: no scalar summary of the profile can simultaneously satisfy analytic stability under paraphrase and concatenation, ordinal robustness across text scales, and cross-representation comparability. We exhibit two practical scalars, $S_{\mathrm{minmax}}$ and $S_{\mathrm{rank}}$, each occupying a distinct corner of this trade-off triangle. Validation across 23 synthetic categories, 5 Project Gutenberg novels, and 3 embedding models confirms the trade-off. The recommended rank-normalized configuration passes 25 of 28 ordinal checks as point estimates (21 of 28 after Benjamini-Hochberg correction), outperforming seven baselines including unigram entropy and a BERTScore-based novelty signal. A separate variational result connects the breadth coordinate to the log-determinant of a determinantal point process (Spearman $\rho = 0.985$ over 507 Gutenberg chapters), giving an optimization-theoretic foundation for breadth.

URL PDF HTML ☆

赞 0 踩 0

2606.11219 2026-06-11 cs.CL cs.AI cs.SD 新提交

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

Afrispeech Semantics: 评估跨领域和口音的口语语言模型中的音频语义推理

Chibuzor Okocha, Christan Grant

发表机构 * University of Florida（佛罗里达大学）

AI总结提出五项语义与副语言推理任务（蕴含、一致性、合理性、口音漂移、口音约束），评估音频语言模型在口音变化、领域迁移和语义过度推断下的推理能力，揭示当前评估的局限性。

详情

Comments: Accepted to ACL

AI中文摘要

音频语言模型（ALMs）越来越多地用于基于语音的理解，但它们在转录、文本到音频检索、字幕生成和问答准确性之外的语义推理能力仍未得到充分基准测试。特别是，口音变化、领域迁移和语义过度推断对音频推理的影响尚不清楚。我们评估了音频语言模型在五项语义和副语言推理任务上的表现：蕴含、一致性、合理性、口音漂移和口音约束。这些任务共同评估模型以口语音频作为主要证据来源进行推理的能力，包括文本假设是否可以从音频中推断、矛盾或无法确定，陈述是否与口语内容一致或冲突，给定话语的声明是否合理，以及模型预测在口音变化下是否保持稳定或适当约束。这些发现凸显了当前音频推理评估的关键局限性，并希望为更稳健和公平的ALM设计与评估提供指导。

英文摘要

Audio language models (ALMs) are increasingly used for speech-based understanding, yet their ability to perform semantic reasoning beyond transcription, Text-to-Audio Retrieval, Captioning, and Question-Answering accuracy remains insufficiently benchmarked. In particular, the effects of accent variation, domain shift, and semantic over-inference on audio reasoning are poorly understood. We evaluate audio language models across five semantic and paralinguistic reasoning tasks: entailment, consistency, plausibility, accent drift, and accent restraint. Collectively, these tasks assess a model's ability to reason over spoken audio as the primary evidence source, including whether a textual hypothesis can be inferred, contradicted, or left undetermined by the audio, whether statements align or conflict with spoken content, whether claims are plausible given the discourse, and whether model predictions remain stable or appropriately constrained across accent variation. These findings highlight critical limitations in current audio reasoning evaluations and hope to provide guidance for more robust and equitable ALM design and assessment

URL PDF HTML ☆

赞 0 踩 0

2606.11213 2026-06-11 cs.CL 新提交

Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

超越压缩：面向长周期智能体的结构化上下文驱逐

Andrew Semenov, Svyatoslav Dorofeev

发表机构 * Kiz8

AI总结提出上下文窗口生命周期（CWL）方案，通过结构化、语义感知的驱逐策略，使长周期LLM智能体在有限上下文预算内实现无限工作视野，避免性能下降和幻觉。

详情

AI中文摘要

我们提出了上下文窗口生命周期（CWL），一种上下文管理方案，为长周期LLM智能体提供有效无界的工作视野。随着会话累积历史，CWL通过渐进式、语义感知的驱逐将上下文保持在预算内：智能体在工作过程中将其轨迹注释为类型化、依赖链接的情节，当令牌预算超出时，一个确定性的、无需LLM的策略在该结构内按优先级顺序驱逐内容。CWL保留用户轮次和智能体正在积极推理的探索上下文，同时积极丢弃其效果已持久化在环境中的行动情节，使活动上下文保持在稳定上限附近，这也避免了与超大提示相关的性能下降。与基于摘要的压缩相比，CWL避免了四个众所周知的局限性：不可预测的信息丢失、因果结构的破坏、阻塞模型成本以及压缩引起的幻觉。与最近截断相比，CWL具有语义感知能力：它根据依赖图丢弃最旧且最可恢复的内容，而不是按时间顺序丢弃最旧的内容而不考虑相关性。我们描述了注释协议、情节图、驱逐策略和令牌记账循环，并在长周期智能体基准上评估了CWL：一个智能体会话在8000万个令牌上完成89个顺序任务，与每任务隔离会话相比，任务准确性没有可测量的下降。

英文摘要

We present Context Window Lifecycle (CWL), a context-management scheme that gives long-horizon LLM agents an effectively unbounded working horizon. As a session accumulates history, CWL keeps the context within budget through graduated, semantically-aware eviction: the agent annotates its trajectory as typed, dependency-linked episodes as work proceeds, and a deterministic, LLM-free policy evicts content in priority order within that structure when a token budget is exceeded. CWL preserves user turns and the exploratory context the agent is actively reasoning over, while aggressively shedding action episodes whose effects are already persisted in the environment, keeping active context near a stable ceiling that also avoids the performance degradation associated with very large prompts. Compared to summarization-based compaction, CWL avoids four well-known limitations: unpredictable lossiness, destruction of causal structure, blocking model cost, and compression-induced hallucination. Compared to recency truncation, CWL is semantically aware: it drops the oldest-and-most-recoverable content according to the dependency graph rather than oldest-in-time regardless of relevance. We describe the annotation protocol, the episode graph, the eviction policy, and the token-accounting loop, and evaluate CWL on long-horizon agentic benchmarks: a single agent session completing 89 sequential tasks across 80 million tokens with no measurable degradation in task accuracy relative to per-task isolated sessions

URL PDF HTML ☆

赞 0 踩 0

2606.11212 2026-06-11 cs.CL 新提交

EverydayGPT: Confidence-Gated Routing for Efficient and Safe Hybrid GPT-RAG Conversational QA

EverydayGPT: 用于高效安全混合GPT-RAG对话问答的置信门控路由

Jaspreet Singh Nahal

发表机构 * Dr. A.P.J. Abdul Kalam Technical University（阿卜杜尔·卡拉姆技术大学）

AI总结提出置信门控路由机制，通过联合策略决定检索与生成路径，使85%的查询使用快速RAG提取，延迟降低120倍以上，同时保持答案质量。

详情

Comments: 12 pages, 10 figures, 6 tables. Code and evaluation scripts available at: this https URL. This paper studies routing strategies for hybrid GPT-RAG systems under resource constraints, focusing on efficiency-safety tradeoffs rather than state-of-the-art accuracy

AI中文摘要

标准检索增强生成（RAG）流水线无条件地将每个查询路由到检索和生成，导致不必要的计算并将低质量上下文传播给生成器。我们引入了EverydayGPT，一个轻量级对话问答系统，围绕置信门控路由（CGR）机制构建，该机制将路由决策形式化为检索距离和提取充分性的联合策略。骨干网络是一个205M参数的GPT，在FineWeb-Edu的10B令牌上从头训练。CGR通过快速RAG提取（~45 ms）解决85%的查询，避免调用昂贵的GPT路径（~5.9s），在大多数查询上实现超过120倍的延迟降低，同时保持答案质量。在500个问题的领域内基准测试中，系统达到F1 = 0.226 +/- 0.004，而仅GPT为0.171，无条件RAG为0.210。相对于强基线的提升虽小但一致，而效率提升显著（平均延迟降低6.3倍）。结构化的基础审计发现采样集中没有无根据的声明，并带有明确的范围限制。我们将这项工作定位为资源约束下路由策略的研究，而非声称最先进的性能。

英文摘要

Standard Retrieval-Augmented Generation (RAG) pipelines route every query through retrieval and generation unconditionally, incurring unnecessary computation and propagating low-quality context to the generator. We introduce EverydayGPT, a lightweight conversational QA system built around a Confidence-Gated Routing (CGR) mechanism that formalises the routing decision as a joint policy over retrieval distance and extraction adequacy. The backbone is a 205M-parameter GPT trained from scratch on 10B tokens of FineWeb-Edu. CGR avoids invoking the costly GPT pathway (~5.9s) for 85 percent of queries by resolving them via fast RAG extraction (~45 ms), yielding over 120x latency reduction on the majority of queries while maintaining answer quality. On a 500-question in-domain benchmark, the system achieves F1 = 0.226 +/- 0.004 compared to 0.171 for GPT-only and 0.210 for unconditional RAG. Gains over strong baselines are modest but consistent, while efficiency improvements are substantial (6.3x mean latency reduction). A structured grounding audit finds no unsupported claims in the sampled set, with explicit scope limitations. We position this work as a study of routing strategies under resource constraints rather than a claim of state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2606.11211 2026-06-11 cs.CL cs.AI cs.LG 新提交

Calibration Drift Under Reasoning: How Chain-of-Thought Budgets Induce Overconfidence in Large Language Models

推理下的校准漂移：思维链预算如何导致大型语言模型过度自信

Prakul Sunil Hiremath, Harshit R. Hiremath

发表机构 * Department of Computer Science and Engineering, Visvesvaraya Technological University, Belagavi（维斯瓦拉亚科技大学计算机科学与工程系，贝拉加维）； Department of Computer Science and Business System, SG Balekundri Institute of Technology, Belagavi（SG巴莱昆德里理工学院计算机科学与商业系统系，贝拉加维）

AI总结研究发现，增加思维链推理预算超过任务特定阈值会导致模型对错误答案过度自信，提出校准漂移现象并引入CABStop停止规则。

详情

Comments: 31 pages, 4 figures, 3 tables. Introduces Calibration Drift Under Reasoning (CDUR) with theoretical analysis and preliminary experiments; includes CABStop; code and data available

AI中文摘要

大型语言模型（LLMs）表达校准不确定性的能力对于安全部署至关重要。思维链（CoT）推理被广泛用于提高准确性和可靠性，但其对校准的影响尚未完全理解。我们表明这一图景是不完整的：在某些设置中，将推理预算增加到任务特定阈值以上会导致模型系统性地变得过度自信，对错误答案赋予高置信度。我们将此现象称为推理下的校准漂移（CDUR），并从理论和实证两方面进行研究。我们定义推理预算B，并分析预期校准误差ECE(B)呈现非单调模式的条件：它首先随着推理纠正错误而下降，然后随着更长推理产生内部一致但错误的解释而上升。我们提出一个基于自回归生成的假设锁定模型来解释这种行为。我们在47个推理陷阱问题上评估了Llama-3.1-8B和Llama-3.3-70B，跨越四个推理预算和三个随机种子（1,368次API调用；574个有效响应）。8B模型显示出非单调的校准行为，而70B模型的结果仅限于基线评估，对于预算依赖效应尚无定论。我们引入CABStop，一种校准感知的停止规则，当置信度偏离辅助准确性估计时停止推理。这些结果表明，增加推理深度并不总是提高可靠性，应谨慎监控。

英文摘要

The ability of large language models (LLMs) to express calibrated uncertainty is important for safe deployment. Chain-of-thought (CoT) reasoning is widely used to improve accuracy and reliability, but its effect on calibration is not fully understood. We show that this picture is incomplete: in some settings, increasing the reasoning budget beyond a task-specific threshold can cause models to become systematically overconfident, assigning high confidence to incorrect answers. We call this phenomenon Calibration Drift Under Reasoning (CDUR) and study it both theoretically and empirically. We define reasoning budget B and analyze conditions under which Expected Calibration Error ECE(B) follows a non-monotonic pattern: it first decreases as reasoning corrects errors, then increases as longer reasoning produces internally consistent but incorrect explanations. We propose a Hypothesis Lock-In model based on autoregressive generation to explain this behavior. We evaluate Llama-3.1-8B and Llama-3.3-70B on 47 reasoning-trap questions across four reasoning budgets and three seeds (1,368 API calls; 574 valid responses). The 8B model shows non-monotonic calibration behavior, while results for the 70B model are limited to baseline evaluation and are inconclusive for budget-dependent effects. We introduce CABStop, a calibration-aware stopping rule that halts reasoning when confidence diverges from an auxiliary accuracy estimate. These results suggest that increasing reasoning depth does not always improve reliability and should be monitored carefully.

URL PDF HTML ☆

赞 0 踩 0

2606.11210 2026-06-11 cs.CL cs.AI cs.MM 新提交

T2MM: An LLM Supported Architecture For Inquiry-Based Modeling

T2MM：一种支持基于探究建模的LLM架构

John Kos, Rudra Singh, Ashok Goel

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出T2MM架构，利用LLM在生态建模软件VERA中生成交互式模型，优于全代码生成基线。

详情

Comments: 16 pages, 4 figures

AI中文摘要

模型构建是科学学习中的基础实践，依赖于可视化和交互性。大型语言模型（LLM）越来越多地增强多模态能力，并已集成到教育环境中以支持学习。然而，这些工具缺乏某些学习环境所需的视觉交互性。我们提出了文本到多模态模型（T2MM），这是一种稳健、动态的LLM支持架构，可在开放探究生态建模软件虚拟实验研究助手（VERA）中辅助模型构建。T2MM考虑学习者模型的当前上下文，并创建交互式模型（而非静态图像），使模型能够对人工调整保持响应。为了衡量技术可行性，我们通过一个自定义的程序生成数据集（包含自然语言学习者建模请求和VERA系统中的目标模型）来评估T2MM。在所有测量的成功指标上，T2MM优于通过LLM支持的全代码生成实现的基线模型生成架构（这在文献中很常见）。我们的贡献不仅概述了将LLM集成到基于探究的学习建模工具中，还描述了一种可能的架构，通过该架构可以创建更具交互性的多模态LLM工具。

英文摘要

Model Construction is a foundational practice in science learning that relies on visualization and interactivity. Large Language Models, increasingly augmented with multimodal capabilities, have been integrated in education contexts to support learning. However, these tools lack visual interactivity that is required by some learning contexts. We introduce Text to Multimodal Model (T2MM), a robust, dynamic LLM supported architecture that assists in model construction within the open inquiry ecology-based modeling software Virtual Experimental Research Assistant (VERA). T2MM accounts for the current context of the learner's model and creates interactive models, rather than static images, enabling the model to remain responsive to manual adjustment. To measure technical feasibility, we evaluate T2MM through a custom procedurally generated dataset of natural language learner modeling requests and target models within the VERA system. T2MM outperforms a baseline model generation architecture implemented through LLM-supported full code generation, common in the literature, across all measured success metrics. Our contribution not only outlines LLM integration into a inquiry-based learning modeling tool, but also describes a possible architecture through which more interactive multimodal LLM tools can be created.

URL PDF HTML ☆

赞 0 踩 0

2606.11209 2026-06-11 cs.CL cs.AI cs.LG 新提交

ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

ProcessThinker: 通过基于展开的过程奖励增强多模态大语言模型推理

Jingpei Wu, Xiao Han, Weixiang Shen, Boer Zhang, Zifeng Ding, Volker Tresp

发表机构 * LMU Munich（慕尼黑大学）； Harvard University（哈佛大学）； University of Cambridge（剑桥大学）； Mina AI ； Konrad Zuse School of Excellence in Reliable AI (relAI)（康拉德·楚泽可靠人工智能卓越学校（relAI））

AI总结提出ProcessThinker，一种无需显式过程奖励模型的后训练方法，通过步骤标记格式和基于展开的过程奖励，为多步推理提供密集的步骤级奖励，提升多模态推理一致性。

详情

Comments: Accepted at ICLR 2026 Workshop on Logical Reasoning of Large Language Models. 7 pages, 1 figure

AI中文摘要

视觉问答越来越需要多步推理。最近在可验证奖励下的强化学习后训练（RLVR）和组相对策略优化（GRPO）可以改善多模态推理，但大多数方法依赖于稀疏的仅结果奖励。因此，它们难以判断错误答案是由于推理后期的一个小错误，还是从一开始就无用的轨迹。一个常见的解决方案是训练一个过程奖励模型（PRM）用于步骤级监督，但这通常需要大规模高质量的思想链注释和额外的训练成本。我们提出ProcessThinker，一种实用的后训练流程，无需训练显式的PRM即可提供步骤级过程奖励。ProcessThinker首先将推理轨迹重写为步骤标记格式以进行冷启动监督微调，然后应用带有标准格式奖励和我们基于展开的过程奖励的GRPO。具体来说，对于每个中间步骤，我们从该步骤采样多个连续步骤，并使用经验成功率（最终答案验证）作为步骤奖励。这提供了密集的信用分配，并鼓励更可靠地支持正确结论的推理步骤，有助于减少跨步骤的不一致或自相矛盾的进展——这是逻辑推理中的一个关键问题。在四个具有挑战性的视频基准测试（Video-MMMU、MMVU、VideoMathQA和LongVideoBench）上，ProcessThinker始终优于基线模型Qwen3-VL-8B-Instruct。

英文摘要

Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning, but most approaches rely on sparse outcome-only rewards. As a result, they struggle to tell whether an incorrect answer comes from a small mistake late in the reasoning or from an unhelpful trajectory from the start. A common solution is to train a process reward model (PRM) for step-level supervision, but this typically requires large-scale high-quality chain-of-thought annotations and additional training cost. We propose ProcessThinker, a practical post-training pipeline that provides step-level process rewards without training an explicit PRM. ProcessThinker first rewrites reasoning traces into a step-tagged format for cold-start supervised fine-tuning, then applies GRPO with a standard format reward and our rollout-based process reward. Concretely, for each intermediate step, we sample multiple continuations from that step and use the empirical success rate (final-answer verification) as the step reward. This gives dense credit assignment and encourages reasoning steps that more reliably support a correct conclusion, helping reduce inconsistent or self-contradictory progress across steps -- a key issue in logical reasoning. Across four challenging video benchmarks (Video-MMMU, MMVU, VideoMathQA, and LongVideoBench), ProcessThinker consistently improves over the baseline model Qwen3-VL-8B-Instruct

URL PDF HTML ☆

赞 0 踩 0

2606.11208 2026-06-11 cs.CL cs.AI 新提交

BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts

BioDivergence：生物医学摘要中隐藏上下文矛盾的基准与评估框架

Elias Hossain, Sanjeda Sara Jennifer, Sabera Akter Bushra, Niloofar Yousefi

发表机构 * College of Engineering and Computer Science, University of Central Florida（中佛罗里达大学工程与计算机科学学院）； Burnett School of Biomedical Sciences, University of Central Florida（中佛罗里达大学伯内特生物医学科学学院）

AI总结提出BioDivergence框架，通过六类冲突分类、13轴分歧本体和结构化输出，解决现有NLI基准无法捕捉生物医学研究中上下文依赖的差异问题，并发布包含11865个声明对的基准数据集。

详情

AI中文摘要

生物医学发现常常在不同研究中看似冲突，但许多差异是上下文依赖的而非真正的矛盾。队列、地理、实验方案、疾病亚型和临床环境的变化可能使两种说法在局部都成立。现有的NLI和科学声明验证基准将此类情况简化为蕴含、矛盾或中立，未能捕捉分歧背后的上下文结构。为解决这一问题，我们引入了BioDivergence，一个包含六类冲突分类、13轴分歧本体以及每个声明对四个结构化输出（冲突类型、分歧轴、主要混杂因素和调和解释）的评估框架。我们发布了BioDivergence-Silver-v1.0，一个跨五个生物医学领域的11865个声明对的文章分离银标准基准，以及一个用于比较的遗留去重变体。结果显示，两种变体之间存在显著的排名差异，微调参考模型在文章分离设置下下降了约12分，而Mistral-7B-Instruct-v0.3在842个示例的主测试集上达到了0.5523的准确率和0.3894的上下文F1分数。BioDivergence提供了一种更忠实的方式来区分上下文分歧与直接矛盾，并区分文章级记忆与真正的任务学习。

英文摘要

Biomedical findings often seem to conflict across studies, but many of these differences are context-dependent rather than true contradictions. Variations in cohort, geography, assay protocol, disease subtype, and clinical setting can make both claims locally valid. Existing NLI and scientific claim-verification benchmarks reduce such cases to entailment, contradiction, or neutral, failing to capture the contextual structure behind divergence. To address this, we introduce BioDivergence, an evaluation framework with a six-class conflict taxonomy, a 13-axis divergence ontology, and four structured outputs per claim pair: conflict type, divergence axes, dominant confounder, and reconciliation explanation. We release BioDivergence-Silver-v1.0, an article-disjoint silver benchmark of 11,865 claim pairs across five biomedical domains, alongside a legacy deduplicated variant for comparison. Results show notable ranking differences between the two variants, with the fine-tuned reference model dropping about 12 points under the article-disjoint setting, while Mistral-7B-Instruct-v0.3 achieves 0.5523 accuracy and 0.3894 contextual-F1 on the 842-example primary test set. BioDivergence offers a more faithful way to distinguish contextual divergence from direct contradiction and to separate article-level memorization from genuine task learning.

URL PDF HTML ☆

赞 0 踩 0

2606.11207 2026-06-11 cs.AI cs.CL 新提交

From Explicit Elements to Implicit Intent: A Predefined Library for Auditable Behavioral Inference

从显式元素到隐式意图：用于可审计行为推断的预定义库

Liu hung ming

发表机构 * PARRAWA AI

AI总结提出SemantiClean框架，通过共享元素库从电商会话数据中提取结构化语义信号，驱动可插拔推断目标，优先保证可审计性和可复现性，而非单纯追求精度。

详情

Comments: 20 pages, 9 tables

AI中文摘要

我们提出SemantiClean，一个模块化框架，用于从电商会话数据中提取结构化语义信号，并通过共享元素库驱动可插拔推断目标，包括购买意图、客户细分和产品亲和性。与仅优化准确率的传统端到端预测器不同，SemantiClean优先考虑可审计性、结构治理和sigma=0可复现性，明确牺牲边际预测增益以换取元素级透明度和可辩护的决策轨迹。该框架基于在线购物者购买意图（OSPI）数据集，将24个行为元素组织成四层架构（功能层、交互层、系统层、上下文层），并通过三种抗通胀机制强制信号质量：RedundancyGroup贡献上限、TieredPenaltyCalculator偏差惩罚和AdaptiveConstraintMode冷启动处理。本文介绍了LLM集成语义推断引擎，一个完全实现的两阶段LLM驱动推断架构，在推断时利用完整的元素元数据。本文报告的所有定量结果均由该引擎产生。确定性引擎输出完全可复现（sigma=0）；LLM相关结果（E8、E10）在固定提供者/模型/温度设置下受控输出可变性。性别推断目标在当前实现中非功能性，已从所有定量结果中排除。

英文摘要

We present SemantiClean, a modular framework for extracting structured semantic signals from e-commerce session data and driving pluggable inference targets including purchase intent, customer segmentation, and product affinity through a shared element library. Unlike conventional end-to-end predictors that optimise solely for accuracy, SemantiClean prioritises auditability, structural governance, and sigma=0 reproducibility, explicitly trading marginal predictive gains for element-level transparency and defensible decision trails. Built upon the Online Shoppers Purchasing Intention (OSPI) dataset, the framework organises twenty-four behavioural elements into a four-layer architecture (Functional, Interaction, Systemic, Contextual) and enforces signal quality through three anti-inflation mechanisms: RedundancyGroup contribution caps, TieredPenaltyCalculator bias penalties, and AdaptiveConstraintMode cold-start this http URL report introduces the LLM-Integrated Semantic Inference Engine, a fully implemented two-phase LLM-driven inference architecture that leverages complete element metadata at inference time. All quantitative results reported herein are produced by this engine. Deterministic engine outputs remain fully reproducible (sigma=0); LLM-dependent results (E8, E10) are subject to controlled output variability under fixed provider/model/temperature settings. The gender inference target remains non-functional in the current implementation and is excluded from all quantitative results.

URL PDF HTML ☆

赞 0 踩 0

2606.11206 2026-06-11 cs.CL cs.LG 新提交

Compatibility-Aware Dynamic Fine-Tuning for Large Language Models

兼容性感知的动态微调用于大型语言模型

Yucheng Zhou, Junwei Sheng, Qianning Wang, Jianbing Shen

发表机构 * SKL-IOTSC, CIS, University of Macau（澳门大学科技学院电脑与信息科学系及智慧城市物联网国家重点实验室）； Auckland University of Technology（奥克兰理工大学）

AI总结提出兼容性感知动态微调（CADFT），通过模型似然度动态调整监督更新，抑制不兼容样本的高方差梯度，提升训练稳定性和泛化能力。

详情

Comments: ACL 2026

AI中文摘要

监督微调（SFT）是对齐大型语言模型（LLMs）的主要范式，但它存在优化不稳定和泛化能力有限的问题。最近的研究将这一问题归因于病态的梯度缩放，并提出了动态微调（DFT）来在令牌级别进行修正。然而，DFT假设所有演示都是同样合适的学习目标，这一假设被大规模指令数据的强异质性所违反，其中演示-策略不匹配会在样本级别导致高方差更新。我们引入了兼容性感知动态微调（CADFT），这是DFT的一个原则性扩展，用于控制样本级别的优化方差。CADFT从模型似然度中推导出一个动态的、依赖于策略的兼容性信号，以调节监督更新，抑制来自不兼容演示的高方差梯度。我们进一步提出了一种延迟的、低频的兼容性引导重写策略，将持续不兼容的演示转化为可学习的目标。我们表明，CADFT可以被解释为一个方差控制的估计器，将DFT中的令牌级稳定性推广到样本级别。大量实验表明，CADFT在保持完全监督且不依赖显式奖励建模的同时，提高了稳定性、泛化能力和冷启动强化学习初始化。

英文摘要

Supervised Fine-Tuning (SFT) is the predominant paradigm for aligning large language models (LLMs), yet it suffers from optimization instability and limited generalization. Recent work attributes this issue to pathological gradient scaling and proposes Dynamic Fine-Tuning (DFT) to correct it at the token level. However, DFT assumes all demonstrations are equally suitable learning targets, an assumption violated by the strong heterogeneity of large-scale instruction data, where demonstration-policy mismatch induces high-variance updates at the sample level. We introduce Compatibility-Aware Dynamic Fine-Tuning (CADFT), a principled extension of DFT that controls sample-level optimization variance. CADFT derives a dynamic, policy-dependent compatibility signal from model likelihoods to modulate supervised updates, suppressing high-variance gradients from incompatible demonstrations. We further propose a delayed, low-frequency compatibility-guided rewriting strategy to transform persistently incompatible demonstrations into learnable targets. We show that CADFT can be interpreted as a variance-controlled estimator that generalizes token-level stabilization in DFT to the sample level. Extensive experiments demonstrate improved stability, generalization, and cold-start reinforcement learning initialization, while remaining fully supervised and independent of explicit reward modeling.

URL PDF HTML ☆

赞 0 踩 0

2606.11204 2026-06-11 cs.CL cs.IR 新提交

Benchmarking Large Language Models for Safety Data Extraction

大型语言模型在安全数据提取中的基准测试

Jonas Grill, Thomas Bayer, Sören Berlinger

发表机构 * SAP SE（SAP公司）； Institute for Digital Transformation, Ravensburg-Weingarten University（拉文斯堡-魏恩加滕大学数字化转型研究所）

AI总结针对安全数据表（SDS）的异构格式，本研究基准测试了四种大型语言模型（LLM）在文本与多模态处理下的提取性能，发现文本结合思维链提示的Gemini 1.5 Pro准确率最高（84%），但均未达到90%的可靠部署阈值。

详情

Comments: 18 pages, 8 figures, submitted to Applied Intelligence

AI中文摘要

从安全数据表（SDS）中准确提取结构化信息在工业安全中仍具挑战性，原因在于文档格式异构以及传统基于规则的方法的局限性。本研究对最先进的大型语言模型（LLM）在自动化SDS数据提取方面进行了基准测试，比较了基于文本和多模态处理流水线。我们系统评估了四种模型：Gemini 1.5 Pro、GPT-4o、Claude 3.7 Sonnet和Llama 3.1-70B，采用三种提示策略：零样本、少样本和思维链。评估框架在超过50,000个提取数据字段上评估了准确性、延迟和成本。结果显示，基于文本的提取在所有指标上始终优于多模态处理。结合思维链提示的Gemini 1.5 Pro达到了最高准确率（84%），优于GPT-4o（81%）和Claude 3.7 Sonnet（79%）。然而，没有模型超过可靠实际部署通常所需的90%准确率阈值。这些发现表明，通用LLM在无监督工业使用中尚不够稳健，尽管性能表明通过任务特定微调具有强大潜力。未来研究应关注领域自适应训练、模型校准以及集成人在回路验证，以确保安全关键可靠性。

英文摘要

Accurate extraction of structured information from Safety Data Sheets (SDS) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule-based methods. This study benchmarks state-of-the-art Large Language Models (LLMs) for automated SDS data extraction, comparing text-based and multimodal processing pipelines. We systematically evaluate four models: Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, and Llama 3.1-70B, across three prompting strategies: zero-shot, few-shot, and chain-of-thought. The evaluation framework assessed accuracy, latency, and cost across more than 50,000 extracted data fields. Results show that text-based extraction consistently outperforms multimodal processing across all metrics. Gemini 1.5 Pro combined with a Chain-of-Thought prompt achieved the highest accuracy (84%), outperforming GPT-4o (81%) and Claude 3.7 Sonnet (79%). However, no model surpassed the 90% accuracy threshold commonly required for reliable real-world deployment. These findings indicate that general-purpose LLMs are not yet robust enough for unsupervised industrial use, though performance suggests strong potential with task-specific fine-tuning. Future research should focus on domain-adapted training, model calibration, and the integration of Human-in-the-Loop verification to ensure safety-critical reliability.

URL PDF HTML ☆

赞 0 踩 0

2606.11203 2026-06-11 cs.CL cs.LG 新提交

LatticeBridge: Rare-Event Sequential Inference for Faithful Structured Sequence Synthesis

LatticeBridge: 用于忠实结构化序列合成的罕见事件序列推理

Faruk Alpay, Bugra Kilictas

发表机构 * Bahcesehir University（巴切塞希尔大学）

AI总结针对结构化序列生成中约束满足的罕见事件问题，提出LatticeBridge方法，结合前缀语言模型、实例编译表面自动机和扭曲序列蒙特卡洛解码器，在多个基准上显著提升锚点满足率和覆盖率。

详情

Comments: 19 pages. Code and benchmark files available at this https URL

AI中文摘要

结构化序列生成通常要求模型在单个输出中满足多个输入派生约束。标准解码方法可能赋予流畅延续高概率，而对同时实现所有必需锚点的延续赋予低概率。我们将此机制视为罕见事件序列推理问题。LatticeBridge 结合了紧凑前缀语言模型、实例编译表面自动机以及带有重采样、多级分裂和源自实例提供短语的源支持提议项的扭曲序列蒙特卡洛 (SMC) 解码器。约束表示从每个输入实例编译而来，不依赖人工整理的词汇类别。在涵盖 CommonGen、E2E NLG 和 WikiBio 的 2,610 个可达到验证任务上，粒子解码器在共享提议模型下，相比贪心、波束过滤和 best-of-k 祖先基线，提高了精确锚点满足率和平均锚点覆盖率。由于仅精确锚点满足不能排除不支持的属性替换，评估同时报告了所需锚点覆盖率、源覆盖率、源入侵诊断、重叠度、运行时间和粒子统计量。该基准在固定提议模型下刻画了忠实度-重叠度-延迟前沿。

英文摘要

Structured sequence generation often requires a model to satisfy several input-derived constraints in a single output. Standard decoding methods may assign high probability to fluent continuations while placing low mass on continuations that realize all required anchors jointly. We study this regime as a rare-event sequential inference problem. LatticeBridge combines a compact prefix language model, instance-compiled surface automata, and a twisted sequential Monte Carlo (SMC) decoder with resampling, multilevel splitting, and a source-support proposal term derived from instance-provided phrases. The constraint representation is compiled from each input instance and does not rely on manually curated lexical classes. On 2,610 attainable validation tasks spanning CommonGen, E2E NLG, and WikiBio, the particle decoder improves exact anchor satisfaction and mean anchor coverage over greedy, beam-filtered, and best-of-k ancestral baselines under a shared proposal model. Since exact anchor satisfaction alone does not rule out unsupported attribute substitutions, the evaluation reports required-anchor coverage, source coverage, source-intrusion diagnostics, overlap, runtime, and particle statistics jointly. The benchmark characterizes the faithfulness-overlap-latency frontier under a fixed proposal model.

URL PDF HTML ☆

赞 0 踩 0

2606.11202 2026-06-11 cs.CL 新提交

One Jailbreak, Many Tongues: Learning Language-Insensitive Intention Representations for Multilingual Jailbreak Detection

一次越狱，多种语言：学习语言无关的意图表示用于多语言越狱检测

Shuyu Jiang, Kaiyu Xu, Xingshu Chen, Hao Ren, Rui Tang, Yi Zhang, Tianwei Zhang, Hongwei Li

发表机构 * School of Cyber Science and Engineering, Sichuan University（四川大学网络空间安全学院）； School of Computer Science and Engineering, Nanyang Technological University（南洋理工大学计算机科学与工程学院）； School of Computer Science and Engineering, University of Electronic Science and Technology of China（电子科技大学计算机科学与工程学院）

AI总结针对多语言LLM安全漏洞，提出MLJailDe框架，通过多语言回译数据增强和相对距离约束，实现跨语言越狱检测，F1达98.5%。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地部署在面向全球多语言用户的应用程序中，然而安全训练仍集中在主流语言上，并未与多语言能力同步发展，从而为越狱攻击创造了可利用的漏洞。当前的越狱防御主要是在主流语言中开发和评估的，其有效性受到对齐的多语言监督稀缺以及语言变异导致的表示分散的限制。为了解决这个问题，我们提出了MLJailDe，一个多语言越狱检测框架，旨在提高多语言鲁棒性和跨语言泛化能力。MLJailDe首先引入了一种多语言回译数据增强算法，构建了一个语义一致且功能有效的数据集，涵盖11种语言，包含2,232个良性样本和1,239个越狱样本。在此基础上，MLJailDe采用相对距离约束来减少跨语言表示分散，并鼓励具有相似意图的越狱提示在不同语言中形成一致的聚类，同时进一步使用不平衡感知的分类目标来缓解类别不平衡并学习更可靠的多语言决策边界。实验结果表明，MLJailDe在多种语言上优于最先进的基线，F1分数达到98.5%，并且在未见过的语言上平均F1分数达到97.1%，展示了强大的有效性和跨语言泛化能力。

英文摘要

Large language models (LLMs) are increasingly deployed in applications for global multilingual users, yet safety training remains concentrated in dominant languages and has not progressed in parallel with multilingual capability, creating exploitable gaps for jailbreak attacks. Current jailbreak defenses are largely developed and evaluated in dominant languages, and their effectiveness is limited by the scarcity of aligned multilingual supervision and representations dispersion caused by language variation. To address this issue, we propose MLJailDe, a multilingual jailbreak detection framework designed to improve both multilingual robustness and cross-lingual generalization. MLJailDe first introduces a multilingual back-translation data augmentation algorithm to construct a semantically consistent and functionally effective dataset spanning 11 languages, consisting of 2,232 benign and 1,239 jailbreak samples. On this basis, MLJailDe employs relative-distance constraints to reduce cross-lingual representation dispersion and encourage jailbreak prompts with similar intent to form consistent clusters across languages, while an imbalance-aware classification objective is further used to alleviate class imbalance and learn more reliable multilingual decision boundaries. Experimental results show that MLJailDe outperforms state-of-the-art baselines across multiple languages, achieving an F1 score of 98.5\%, and obtains an average F1 score of 97.1\% on unseen languages, demonstrating strong effectiveness and cross-lingual generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.11201 2026-06-11 cs.LG cs.AI cs.CL 新提交

To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending

干预还是不干预：通过概率模型混合指导推理时对齐

Jin Gan, Xin Li, Jun Luo

发表机构 * College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算机与数据科学学院）

AI总结提出BlendIn框架，通过质量感知对齐和按可靠性加权混合模型知识，解决推理时对齐中指导有效性差异大的问题，在困难模型对上实现最高50%的性能提升。

详情

Comments: Accepted by ACL 2026

AI中文摘要

LLM的广泛部署使得模型对齐成为必要，以确保新训练的模型能够安全有效地响应用户指令。在不同方法中，推理时对齐通常更便宜，因为它仅在输出生成期间进行干预（即提供指导）。现有提案从某些对齐模型中提取指导，但没有适当评估其可靠性。然而，我们的系统评估显示，指导有效性在不同模型间差异很大；由于无效指导会导致进一步混乱和更多干预，由此产生的过度干预通常表明性能较差。为了使干预更有效且更高效，我们引入了BlendIn，一个推理时对齐框架，从二元决策转向创建整合两个模型知识的混合分布。BlendIn通过执行质量感知对齐并根据可靠性按比例加权每个模型的贡献来稳定推理时对齐。与现有工作相比，它保留了有益的指导，同时降低了不可靠建议的权重。BlendIn为未对齐的指导提供了诊断信号和缓解策略，在困难模型对上实现了一致且高达50%的性能提升。我们的代码可在以下网址获取：this https URL。

英文摘要

The wide deployment of LLMs has made model alignment necessary to make newly trained models safely and effectively respond to user instructions. Among different methods, inference-time alignment is often cheaper as it intervenes (i.e., offers guidances) only during output generation. Existing proposals apply guidances extracted from certain aligned models without properly assessing their reliability. Nonetheless, our systematic evaluation reveals that guidance effectiveness varies drastically across models; since ineffective guidances lead to further confusion and thus further interventions, the resulting excessive interventions typically indicate poor performance. To make interventions more effective and thus more efficient, we introduce BlendIn, an inference-time alignment framework that shifts from binary decisions to creating hybrid distributions integrating both models' knowledge. BlendIn stabilizes inference-time alignment by performing quality-aware alignment and proportionally weighting each model's contribution based on reliability. Compared with existing works, it preserves beneficial guidance while downweighting unreliable suggestions. BlendIn provides both diagnostic signals and mitigation strategies for misaligned guidance, achieving consistent and up to 50% performance improvement on challenging model pairs. Our code is available at: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11200 2026-06-11 cs.CL cs.CV 新提交

Detecting AI-Generated Content on Social Media with Multi-modal Language Models

使用多模态语言模型检测社交媒体上的AI生成内容

Chenyang Yang, Shen Yan, Yibo Yang, Litao Hu, Yuchen Liu, Yuan Zeng, Hanchao Yu, Yinan Zhu, Sumedha Singla, Brian Vanover, Huijun Qian, Zihao Wang, Fujun Liu, Aashu Singh, Jianyu Wang, Xuewen Zhang

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Meta

AI总结针对AI生成内容检测的泛化性差、单模态依赖和缺乏可解释性问题，提出基于多模态数据的紧凑视觉-语言模型，实现检测与解释，在公开基准和内部数据集上达到最优性能。

详情

AI中文摘要

生成式AI使得逼真的图像和视频得以创建，并越来越多地在社交媒体上传播，通常用于垃圾信息、错误信息、操纵和欺诈。现有的AI生成内容（AIGC）检测方法面临挑战，包括对新一代模型的泛化能力差、依赖单一模态以及缺乏可解释的解释。我们提出了一个流程，通过持续整理多样化的多模态社交媒体数据并训练一个紧凑的视觉-语言模型用于检测和解释，来缓解这些问题。我们的模型在公开基准上达到了最先进的检测性能，并在多个平台的内部社交媒体数据集上展示了强大的检测和解释能力。我们将模型部署在社交媒体平台上用于帖子推荐，并观察到对用户参与度的积极下游影响，表明在动态、真实的社交媒体环境中进行有效的AIGC检测是可行的。

英文摘要

Generative AI has enabled the creation of photorealistic images and videos that are increasingly disseminated on social media, often used for spam, misinformation, manipulation, and fraud. Existing AI-generated content (AIGC) detection methods face challenges including poor generalization to new generation models, reliance on single modalities, and lack of interpretable explanations. We present our pipeline that mitigates these issues by continuously curating diverse multi-modal social media data and training a compact vision-language model for detection and explanation. Our model achieves state-of-the-art detection performance on public benchmarks and demonstrates robust detection and explanation capabilities on internal social media datasets across multiple platforms. We deployed our model for post recommendation on social media platforms and observed positive downstream impacts on user engagement, demonstrating that it is feasible to perform effective AIGC detection in dynamic, real-world social media environments.

URL PDF HTML ☆

赞 0 踩 0

2606.11198 2026-06-11 cs.CL cs.AI 新提交

The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content

结构注意力税：检索格式如何劫持上下文学习而与内容无关

Yuqi Zhang, Di Zhang

发表机构 * Xi’an Jiaotong-Liverpool University（西交利物浦大学）

AI总结研究发现知识图谱三元组因其格式结构比自然语言吸引2-3倍注意力，压缩演示注意力达42%，并提出了分解注意力为语义与结构成分的框架及缓解策略。

详情

Comments: 10 pages, 5 figures

AI中文摘要

检索增强生成（RAG）系统注入外部知识以改进大语言模型输出，然而注入内容的格式——区别于其语义相关性——可以独立地扭曲模型的注意力分布。我们识别并形式化了一种称为结构注意力税的现象：知识图谱（KG）三元组，由于其关系分隔符和重复的槽位模式，每个token捕获的注意力是语义等价的自然语言文本的2-3倍（$\hat{o}$(KG) ≈ 0.70 对比 $\hat{o}$(中性) ≈ 0.25），将演示注意力压缩高达42%——无论三元组是相关还是噪声。我们开发了一个形式化框架，将注意力分数分解为语义和结构成分（公式2），推导了一个压缩界（命题1），将token级别的格式偏差与演示注意力损失联系起来，并表明结构项控制着注意力被转移多少，而语义项控制着这是有益还是有害。这种解耦揭示了改进检索增强ICL的两个正交轴：优化检索质量（语义轴）和减少格式驱动的注意力捕获（结构轴）。实验上，在两个模型家族（Mistral-7B, LLaMA-3-8B）和三个QA基准上，我们观察到源任务对齐占主导地位：任务匹配的BM25检索在HotpotQA上达到58-62%，而ConceptNet为25-27%，超过30个百分点的差距远远超过所有门控策略（≤2个百分点）。我们从该框架推导出五种结构感知缓解策略，从零成本提示修改到训练时正则化；格式展平（S3）通过来自口头化三元组控制的准确性和注意力级证据得到验证，而结构分散（S1）产生了混合结果，揭示了格式级别干预的挑战。

英文摘要

Retrieval-augmented generation (RAG) systems inject external knowledge to improve LLM outputs, yet the format of injected content -- distinct from its semantic relevance -- can independently distort the model's attention distribution. We identify and formalise a phenomenon we term the structural attention tax: knowledge graph (KG) triples, due to their relational delimiters and repeated slot patterns, capture 2-3x more attention per token than semantically equivalent natural-language text ($\hat{o}$(KG) $\approx$ 0.70 vs. $\hat{o}$(neutral) $\approx$ 0.25), compressing demonstration attention by up to 42% -- regardless of whether the triples are relevant or noise. We develop a formal framework decomposing attention scores into semantic and structural components (Eq. 2), derive a compression bound (Proposition 1) connecting token-level format bias to demonstration attention loss, and show that the structural term governs how much attention is diverted while the semantic term governs whether this helps or hurts. This decoupling reveals two orthogonal axes for improving retrieval-augmented ICL: optimising retrieval quality (semantic axis) and reducing format-driven attention capture (structural axis). Empirically, across two model families (Mistral-7B, LLaMA-3-8B) and three QA benchmarks, we observe that source-task alignment dominates: task-matched BM25 retrieval achieves 58-62% on HotpotQA vs. ConceptNet's 25-27%, a >30 pp gap that dwarfs all gating strategies ($\leq$2 pp). We derive five structure-aware mitigation strategies from the framework, ranging from zero-cost prompt modifications to training-time regularisation; format flattening (S3) is validated by both accuracy and attention-level evidence from a verbalized-triple control, while structural dispersal (S1) yields mixed results that illuminate the challenges of format-level intervention.

URL PDF HTML ☆

赞 0 踩 0

2606.11196 2026-06-11 cs.CL cs.AI cs.CR cs.LG 新提交

PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

PoQ-Judge：去中心化LLM推理中成本感知的证明质量的多架构评估框架

Arther Tian, Alex Ding, Frank Chen, Simon Wu, Aaron Chan

发表机构 * DGrid AI

AI总结提出PoQ-Judge框架，训练专用裁判模型对查询-输出对进行无参考评分，研究三种架构，最佳模型在Pearson相关性上达到0.747，级联评估降低72.7%成本。

详情

AI中文摘要

去中心化LLM推理网络需要轻量级、无参考的质量评估用于证明质量（PoQ）。我们提出PoQ-Judge，一个训练专用裁判模型对查询-输出对进行评分而无真实参考的框架。我们研究了三种架构在质量-成本权衡中的表现：TextCNN裁判、MiniLM交叉编码器和DeBERTa裁判。通过在UltraFeedback和GPT标记的领域内数据上进行两阶段训练，最佳模型在保留测试集上与真实代理的Pearson相关性达到0.747，优于先前工作中基于参考的评估器。作为复合评分中的无参考组件，它实现了0.645的Pearson相关性，匹配最佳单一基于参考的评估器，同时消除了对参考答案的需求。我们还表明，在线校准将语义质量识别为主导维度，级联评估将成本降低72.7%，仅带来适度的质量损失。结果在问答任务上比摘要任务强得多，表明代理质量是主要剩余限制。

英文摘要

Decentralized LLM inference networks need lightweight, reference-free quality evaluation for Proof of Quality (PoQ). We present PoQ-Judge, a framework that trains dedicated judge models to score query-output pairs without ground-truth references. We study three architectures across the quality-cost tradeoff: a TextCNN judge, a MiniLM cross-encoder, and a DeBERTa judge. Using two-stage training on UltraFeedback plus GPT-labeled in-domain data, the best model reaches 0.747 Pearson correlation with the ground-truth proxy on a held-out test set, outperforming reference-based evaluators from prior work. As a reference-free component in composite scoring, it achieves 0.645 Pearson correlation, matching the best single reference-based evaluator while removing the need for reference answers. We also show that online calibration identifies semantic quality as the dominant dimension and that cascade evaluation reduces cost by 72.7 percent with only modest quality loss. Results are much stronger on QA than summarization, pointing to proxy quality as the main remaining limitation.

URL PDF HTML ☆

赞 0 踩 0

2606.11192 2026-06-11 cs.LG math.OC 新提交

Restless bandits with imperfect binary feedback: PCL-indexability analysis and computation

具有不完美二元反馈的 restless bandits: PCL-indexability 分析与计算

José Niño-Mora

发表机构 * Universidad Carlos III de Madrid（马德里卡洛斯三世大学）

AI总结针对具有二元隐状态和不完美二元反馈的 restless bandits，提出基于部分守恒律（PCL）的分析与计算框架，通过验证定理、确定性骨架和组合词方法建立可索引性并计算 Whittle 指数，实验表明 MP 指数策略优于基准策略。

详情

Comments: 59 pages, 12 figures, submitted 27/3/2026

AI中文摘要

我们研究具有二元隐状态和不完美二元反馈的 restless bandits，受具有感知错误的机会频谱接入启发。对于相关的信念状态模型，我们开发了一个基于部分守恒律（PCL）的分析与计算框架，用于建立可索引性和评估 Whittle 指数，该框架建立在实状态折扣 restless bandits 的验证定理之上。该框架通过相关的确定性骨架、更新分解和组合词分析随机动力学。它在几个阈值区域中为折扣奖励和资源度量提供了易处理的表达式，从而能够在那里完全验证 PCL 可索引性条件。对于本文中未实现完整分析验证的剩余区域，我们推导了用于计算相关边际度量和边际生产率（MP）指数的有效数值方案，当这些条件成立时，MP 指数等于 Whittle 指数。广泛的计算实验提供了强有力的证据，表明这些条件也在该区域中成立，跨越广泛的参数范围，且没有先前工作中施加的严格参数限制。实验进一步表明，MP 指数策略通常优于标准基准策略，且往往有显著优势。

英文摘要

We study restless bandits with binary latent states and imperfect binary feedback, motivated by opportunistic spectrum access with sensing errors. For the associated belief-state model, we develop a partial conservation laws (PCL)-based analytical and computational framework for establishing indexability and evaluating the Whittle index, building on a verification theorem for real-state discounted restless bandits. The framework analyzes the stochastic dynamics via an associated deterministic skeleton, renewal decompositions, and combinatorics on words. It yields tractable expressions for discounted reward and resource metrics in several threshold regimes, enabling full verification of the PCL-indexability conditions there. For the remaining regime, where a complete analytic verification is not achieved in this paper, we derive efficient numerical schemes for computing the relevant marginal metrics and the marginal productivity (MP) index, which equals the Whittle index when those conditions hold. Extensive computational experiments provide strong evidence that these conditions also hold in that regime across broad parameter ranges and without the stringent parameter restrictions imposed in prior work. The experiments further show that theMP index policy typically outperforms standard benchmark policies, often by a substantial margin.

URL PDF HTML ☆

赞 0 踩 0

2606.10376 2026-06-11 cs.AI cs.IT 交叉投稿

Belief-Space Control for Personalized Cancer Treatment via Active Inference

基于主动推理的个性化癌症治疗信念空间控制

Deniz Sargun, H. Bugra Tulay, C. Emre Koksal

发表机构 * American Association for Cancer Research（美国癌症研究协会）； AACR Project GENIE registry（AACR Project GENIE 注册中心）； AACR Project GENIE Biopharma Collaborative（AACR Project GENIE 生物制药合作组织）

AI总结提出用主动推理将癌症治疗建模为信念空间规划问题，在测量预算下统一目标导向控制与信息获取，实现患者分类与高效治疗。

详情

Comments: 11 pages including appendix

AI中文摘要

癌症治疗本质上是一个具有部分可观测性、潜在患者异质性以及医疗测量预算明确约束的序贯决策问题。与标准强化学习（RL）方法控制状态轨迹不同，癌症治疗会永久性地改变患者的转移动力学，从而改变状态随时间演化的方式。我们使用主动推理将癌症治疗建模为信念空间规划问题，推导出一个期望自由能目标，该目标在测量预算下统一了目标导向控制和信息获取。我们使用来自AACR Project GENIE Biopharma Collaborative数据集的真实临床癌症数据实现了该框架。临床数据结果表明，在真实的测量和治疗约束下，能够同时实现患者分类和高治疗效力。

英文摘要

Cancer treatment is at the core a sequential decision-making problem with partial observability, latent patient heterogeneity, and explicit constraints on the budget for medical measurements. Unlike standard Reinforcement Learning (RL) approaches that control state trajectories, cancer treatments permanently modify patients' transition dynamics, changing how states evolve over time. We model cancer treatment as a belief-space planning problem using active inference, deriving an expected free-energy objective that unifies goal-directed control and information acquisition under measurement budgets without. We implement this framework using real clinical cancer data from the AACR Project GENIE Biopharma Collaborative dataset. Results on clinical data demonstrate a simultaneous patient categorization and high treatment efficacy, under real measurement and treatment constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.11152 2026-06-11 cs.CV 版本更新

P3D-Bench: Benchmarking MLLMs for Parametric 3D Generation and Structural Reasoning

P3D-Bench：用于参数化3D生成与结构推理的多模态大语言模型基准

Yikang Yang, Zhanpeng Hu, Youtian Lin, Mengqi Zhou, Jingxi Xu, Feihu Zhang, Jiaheng Liu, Yao Yao

发表机构 * Nanjing University（南京大学）； Envision

AI总结提出P3D-Bench基准，通过参数化3D程序评估多模态大语言模型在几何精度、语义对齐和装配一致性上的表现，涵盖文本到3D、图像到3D和装配3D三类任务。

详情

Comments: Project page: this https URL

AI中文摘要

多模态大语言模型能够编写代码生成复杂程序，并利用程序进行3D建模，这为基于其先验知识、世界知识和推理能力的3D生成开辟了新途径。然而，现有基准很少通过代码评估3D建模。这种建模不仅需要可运行代码：从文本或视觉规范出发，模型必须生成几何精确、语义对齐且装配一致的参数化3D程序。我们引入P3D-Bench，一个用于参数化3D生成的基准。与3D网格不同，参数化3D程序暴露了显式尺寸、构造操作和零件关系，揭示了模型是否恢复设计结构而不仅仅是外观。在统一协议下，P3D-Bench涵盖三个任务族（文本到3D、图像到3D和装配3D），并对每个输出进行可执行性、几何保真度、拓扑、文本约束、多视图语义对齐和零件级结构的评分。我们在400个文本案例、400个图像案例和203个带注释的装配体上评估了前沿多模态大语言模型和纯文本大语言模型，并以领域特定模型作为参考点。我们的广泛评估得出三个发现。首先，装配是最困难的设置，模型仍然无法将多个零件组合成连贯结构。其次，模型通常能恢复目标对象的整体形状和语义身份，但无法再现输入指定的精确参数化几何。第三，零件级建模在装配上仍然薄弱，模型既不能恢复每个零件的几何形状，也不能恢复正确的零件数量。这些结果使P3D-Bench成为评估参数化3D生成中精确参数化几何和零件级结构的基准。

英文摘要

Multimodal large language models can write code to produce complex programs as well as use programs to do 3D modeling, which opens up a new avenue for 3D generation powered by their priors, world knowledge and reasoning. Yet existing benchmarks rarely evaluate 3D modeling through code. Such modeling demands more than runnable code: from a text or visual specification, a model must generate a parametric 3D program that is geometrically precise, semantically aligned and assembly-consistent. We introduce P3D-Bench, a benchmark for parametric 3D generation. Unlike a 3D mesh, a parametric 3D program exposes explicit dimensions, construction operations and part relations, revealing whether a model recovers a design's structure, not just its appearance. Under a unified protocol, P3D-Bench covers three task families (Text-to-3D, Image-to-3D and Assembly-3D) and scores each output for executability, geometric fidelity, topology, text-grounded constraints, multiview semantic alignment and part-level structure. We evaluate frontier MLLMs and text-only LLMs on 400 text cases, 400 image cases and 203 annotated assemblies, with domain-specific models as reference points. Our extensive evaluation yields three findings. First, assemblies are the hardest setting, where models still fail to compose multiple parts into a coherent structure. Second, models can often recover the global shape and semantic identity of the target object, yet fail to reproduce the precise parametric geometry specified by the input. Third, part-level modeling remains weak on assemblies, where models recover neither the geometry of each part nor the right number of parts. These results position P3D-Bench as a benchmark for evaluating precise parametric geometry and part-level structure in parametric 3D generation.

URL PDF HTML ☆

赞 0 踩 0

2606.11074 2026-06-11 cs.CL cs.AI 版本更新

Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models

建模复杂行为：视觉语言模型中的多人格组合与动态切换

Peiqi Jia, Haonan Jia, Ziqi Miao, Linkang Du, Yuntao Wang, Zhou Su

发表机构 * Xi'an Jiaotong University（西安交通大学）； Beihang University（北京航空航天大学）

AI总结本研究在视觉语言模型中引入显式人格条件，建立包括单人格、多人格和人格切换的系统评估框架，发现人格提示可提升图像描述但损害精确推理任务，并观察到多特质组合与动态切换中的平衡与残留效应。

详情

Comments: 16 pages, 4 figures, 10 tables

AI中文摘要

随着多模态大语言模型（MLLMs）在社交互动中的广泛部署，理解和控制其在复杂人格条件下的行为至关重要。本文引入显式人格条件，并建立了一个系统的评估框架，涵盖单人格诱导、多人格诱导和人格切换。实验表明，人格诱导能提升图像描述性能，但会损害需要精确推理的任务（如视觉问答）的性能。在多特质组合和动态切换过程中观察到平衡和残留效应，表明模型行为受到先前和当前人格约束的共同调节。现有的基于提示的人格诱导方法在多模态设置中表现出有限的迁移性。我们的工作揭示了MLLMs中人格建模的动态和复杂性质，并强调了针对人格诱导和评估的鲁棒、定制化方法的必要性。代码将在论文被接收后发布。

英文摘要

With the widespread deployment of Multimodal Large Language Models (MLLMs) in social interaction, understanding and controlling their behavior under complex personality conditions is essential. This paper introduces explicit personality conditioning and establishes a systematic evaluation framework encompassing single-personality induction, multi-personality induction, and personality switching. Experiments show that personality induction improves image captioning performance but can impair performance on tasks requiring precise reasoning, such as visual question answering (VQA). Balancing and residual effects are observed during multi-trait composition and dynamic switching, indicating that model behavior is co-modulated by both previous and current personality constraints. Existing prompt-based personality induction methods show limited transferability to multimodal settings. Our work reveals the dynamic and complex nature of personality modeling in MLLMs and underscores the need for robust, tailored methods for personality induction and evaluation. The code will be released when the paper is accepted.

URL PDF HTML ☆

赞 0 踩 0

2606.10968 2026-06-11 cs.LG cs.AI 版本更新

Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

超越大语言模型强化学习中的统一令牌级信任区域

Renjie Mao, Xiangxin Zhou, Lvfang Tao, Yixin Ding, Yu Shi, Yongguang Lin, Yuheng Wu, Honglin Zhu, Qian Qiu, Wenxi Zhu

发表机构 * Tencent Hunyuan（腾讯混元）

AI总结针对PPO风格信任区域在自回归生成中的位置无关问题，提出CPPO方法，通过位置加权阈值和累积前缀预算动态调整令牌级约束，提升训练稳定性和推理准确性。

详情

Comments: Project Page: this https URL

AI中文摘要

具有可验证奖励的强化学习（RLVR）已成为提升大语言模型推理能力的标准方法。然而，现有的PPO风格信任区域机制通过在所有令牌上独立施加统一阈值，仍然是位置无关的。这种逐点处理方式在两个方面与自回归生成相冲突。首先，统一阈值忽略了自回归不对称性。早期阶段的偏差会产生累积的序列级漂移，导致静态阈值对早期发散约束不足，而对后期探索过度约束。其次，孤立地评估令牌级发散忽略了累积前缀漂移，无论条件历史已经偏离滚动策略多远，都给予相同的发散允许量。为解决这一局限性，我们提出了CPPO（累积前缀散度策略优化），这是一种令牌级掩码规则，通过两种耦合机制将更新与有限时域策略改进界对齐。首先，位置加权阈值对早期位置施加更严格的限制，因为这些位置的影响持续时间更长，同时放宽对后期令牌的约束。其次，累积前缀预算跟踪历史偏差，动态限制进一步的令牌级偏差，以防止沿前缀的复合错误。实验表明，CPPO在不同模型规模上增强了训练稳定性并显著提高了推理准确性。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasoning. However, existing PPO-style trust-region mechanisms remain position-agnostic by enforcing uniform thresholds across all tokens independently. This pointwise treatment conflicts with autoregressive generation in two critical ways. First, uniform thresholds ignore autoregressive asymmetry. Early-stage deviations produce compounding sequence-level drift, causing static thresholds to under-regulate early divergence and excessively constrain late-stage exploration. Second, evaluating token-level divergence in isolation overlooks cumulative prefix drift, granting the same divergence allowance regardless of how far the conditioning history has already deviated from the rollout policy. To address this limitation, we propose CPPO (Cumulative Prefix-divergence Policy Optimization), a token-level masking rule that aligns updates with a finite-horizon policy-improvement bound via two coupled mechanisms. First, a position-weighted threshold imposes stricter limits at early positions whose effects persist longer, relaxing constraints for late-stage tokens. Second, a cumulative prefix budget tracks historical deviations, dynamically restricting further token-level deviation to prevent compounding errors along the prefix. Empirically, CPPO enhances training stability and significantly improves reasoning accuracy across various model scales.

URL PDF HTML ☆

赞 0 踩 0

2606.10820 2026-06-11 cs.LG cs.AI cs.CL 版本更新

K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

K-Forcing：通过前推语言建模进行联合下一K词解码

Zhiwei Tang, Yuanyu He, Yizheng Han, Wangbo Zhao, Jiasheng Tang, Fan Wang, Bohan Zhuang

发表机构 * DAMO Academy, Alibaba Group（阿里巴巴达摩院）； Hupan Lab（湖畔实验室）； Zhejiang University（浙江大学）； The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出K-Forcing范式，通过前推映射将自回归模型蒸馏为单次前向传播生成多个未来词，实现2.4-3.5倍加速，质量损失小。

详情

Comments: Code: this https URL

AI中文摘要

自回归语言建模是文本生成的主导范式，但其逐词顺序解码使得推理受限于内存且效率低下。现有的加速方法（如推测解码和扩散语言模型）在特定条件下可提升速度，但并未直接解决高负载批量服务——这一对工业级部署最为关键的场景。我们提出K-Forcing，一种用于联合下一k词解码的前推语言建模范式。K-Forcing将现有自回归模型蒸馏为条件前推映射——该映射在单次前向传播中将独立均匀噪声变量转换为多个未来词的联合样本。该设计保留了固定长度输出，复用了自回归教师模型的主干，并与标准自回归服务基础设施兼容。我们通过渐进式自强迫蒸馏训练该映射，逐步扩展预测窗口，同时使学生模型紧密匹配自回归教师模型的序列分布。我们在LM1B和OpenWebText上使用标准因果Transformer主干评估K-Forcing。当激进配置为每次前向传播生成k=4个词时，K-Forcing在不同批量大小下实现约2.4-3.5倍加速，同时相对于自回归教师模型仅带来轻微的质量下降。随着推理在现代LLM的生命周期计算成本中占据主导地位，K-Forcing为在现实高负载部署下加速自回归生成提供了一条有前景的途径。

英文摘要

Autoregressive (AR) language modeling is the dominant paradigm for text generation, yet its sequential token-by-token decoding makes inference memory-bound and inefficient. Existing acceleration approaches, such as speculative decoding and diffusion language models, can yield speedups under certain conditions but do not directly address high-load batch serving--the scenario most critical for industrial-scale deployment. We introduce K-Forcing, a push-forward language modeling paradigm for joint next-k-token decoding. K-Forcing distills an existing AR model into a conditional push-forward mapping--one that transforms independent uniform noise variables into a joint sample of multiple future tokens in a single forward pass. This design preserves fixed-length outputs, reuses the AR teacher backbone, and remains compatible with standard AR serving infrastructure. We train this mapping via progressive self-forcing distillation, which gradually expands the prediction window while enabling the student to closely match the sequence distribution of the AR teacher. We evaluate K-Forcing on LM1B and OpenWebText using a standard causal Transformer backbone. When aggressively configured to generate k = 4 tokens per forward pass, K-Forcing delivers approximately 2.4-3.5x speedup across different batch sizes, while incurring modest quality degradation relative to its AR teacher. As inference increasingly dominates the lifetime compute cost of modern LLMs, K-Forcing offers a promising route toward accelerating AR generation under real-world high-load deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.10804 2026-06-11 cs.CV 版本更新

SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning

SCAIL-2：通过端到端上下文条件统一受控角色动画

Wenhao Yan, Fengjia Guo, Zhuoyi Yang, Jie Tang

发表机构 * Z.ai ； Tsinghua University（清华大学）

AI总结提出SCAIL-2框架，通过端到端上下文条件统一受控角色动画，绕过中间表示直接利用驱动视频，并合成MotionPair-60K数据集，采用上下文掩码和模式RoPE实现统一，结合Bias-Aware DPO减少误差，显著优于现有方法。

详情

AI中文摘要

受控角色动画需要将运动从驱动序列转移到参考角色。先前的工作严重依赖中间表示，包括用于表示运动的姿态骨架或用于表示环境的掩码背景，这不可避免地导致信息损失。为了解决这个问题，我们提出了SCAIL-2，一个绕过这些中间表示并实现\textbf{端到端}角色动画的框架。通过将驱动视频直接连接到序列，模型可以从输入视频中获得所有所需的视觉信息。为了解决缺乏端到端数据的问题，我们通过解耦条件统一角色动画的子任务，然后策划一个流程来合成MotionPair-60K，一个包含角色动画异构任务的端到端运动转移数据集。为了实现统一，我们利用上下文掩码条件和模式特定的RoPE作为文本指令和原始视觉信息之外的软引导。为了解决详细区域的合成差异，我们提出了Bias-Aware DPO来构建偏好项目以减轻误差。大量实验表明，我们的方法在各种角色动画任务中显著优于现有的最先进方法。合成数据的一个大子集以及模型权重将在我们的项目页面发布：this https URL。

英文摘要

Controlled character animation requires transferring motion from a driving sequence to a reference character. Prior works heavily rely on intermediate representations, including pose skeletons to represent motion or masked background to represent environment, which inevitably leads to information loss. To address this, we present SCAIL-2, a framework that bypasses those intermediates and achieves \textbf{end-to-end} character animation. By directly concatenating driving videos to the sequence, the model can obtain all the required visual information from the input video. To address the lack of end-to-end data, we unify sub-tasks of character animation with decoupled conditions and then curate a pipeline to synthesize MotionPair-60K, an end-to-end motion transfer dataset containing heterogeneous tasks of character animation. To achieve the unification, we utilize in-context mask conditioning and mode-specific RoPE as soft guidance beyond textual instructions and raw visual information. To address synthetic discrepancy in detailed regions, we propose Bias-Aware DPO to construct preference items to mitigate the errors. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches in various character animation tasks. A large subset of synthetic data as well as model weights will be released at our project page: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.10794 2026-06-11 cs.AI 版本更新

READER: Robust Evidence-based Authorship Decoding via Extracted Representations

READER: 基于提取表示的鲁棒证据作者身份解码

Jiaxu Liu, Sunnan Mu, Dong Huang, Liuyin Wang, Jing Shao, Jie Zhang

发表机构 * National University of Singapore（新加坡国立大学）； Xidian University（西安电子科技大学）； Tsinghua University（清华大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结针对黑盒LLM来源识别问题，提出READER框架，通过冻结代理LLM读取隐藏作者证据，利用贝叶斯证据累积实现多查询归因，在Agent500数据集上显著优于基线方法。

详情

AI中文摘要

随着智能体应用越来越多地通过官方和第三方LLM API路由用户任务，来源成为一个操作性问题：哪个模型生成了给定的黑盒响应？我们研究动态黑盒LLM来源识别：从由查询变化、非预定义提示（而非固定输入集或基准套件）引发的生成中识别源LLM。这种设置很困难，因为提示语义主导文本，而模型特定的作者痕迹在表面层面是微弱且不一致的。我们引入READER（基于提取表示的鲁棒证据作者身份解码），一种轻量级来源框架，将冻结的代理LLM视为隐藏作者证据的读取器。READER将黑盒输出映射到代理激活空间，在时间上过滤每个响应中的令牌状态，并通过跨独立采样提示求和单响应对数后验证据来执行贝叶斯证据累积。这避免了提示特定表示的脆弱平均池化，同时保留了校准置信度所需的查询级证据。在Agent500（一个基于智能体风格提示构建的50目标数据集）上，READER从单个响应达到31.0%-42.4%的top-1准确率，从50个响应达到70.0%-84.0%的准确率，显著优于句子编码器指纹。跨九个代理读取器的扩展进一步表明，更强的LLM暴露更多线性可解码的作者身份结构，表明作者身份感知已经存在于冻结的LLM表示中，并且可以转化为可靠的多查询归因。

英文摘要

As agentic applications increasingly route user tasks through official and third-party LLM APIs, provenance becomes an operational question: which model generated a given black-box response? We study Dynamic Black-Box LLM Provenance: identifying the source LLM from generations elicited by query-varying, non-predefined prompts rather than a fixed input set or benchmark suite. This setting is difficult because prompt semantics dominate the text, while model-specific authorship traces are weak and inconsistent at the surface level. We introduce READER (Robust Evidence-based Authorship Decoding via Extracted Representations), a lightweight provenance framework that treats a frozen proxy LLM as a reader of hidden authorship evidence. READER maps black-box outputs into proxy activation space, temporally filters token states within each response, and performs Bayesian Evidence Accumulation by summing single-response log-posterior evidence across independently sampled prompts. This avoids fragile mean-pooling of prompt-specific representations while preserving the query-wise evidence needed for calibrated confidence. On Agent500, a 50-target dataset built from agent-style prompts, READER reaches $31.0$-$42.4\%$ top-1 accuracy from a single response and $70.0$-$84.0\%$ from 50 responses, substantially outperforming sentence-encoder fingerprints. Scaling across nine proxy readers further shows that stronger LLMs expose more linearly decodable authorship structure, suggesting that authorship perception is already present in frozen LLM representations and can be converted into reliable multi-query attribution.

URL PDF HTML ☆

赞 0 踩 0

2606.10775 2026-06-11 cs.CV 版本更新

Spatially Selective Self-Training for Unsupervised Building Change Detection

空间选择性自训练用于无监督建筑变化检测

Wafaa I. M. Hussin, Zhi Lu, Anas M. I. Mohammed, Xiang Zhou, Ratiba A. H. Abubaker, Zhenming Peng

发表机构 * School of Information and Communication Engineering, University of Electronic Science and Technology of China（电子科技大学信息与通信工程学院）； Chengdu Yaguang Electronic Co., Ltd.（成都亚光电子股份有限公司）； Laboratory of Intelligent Collaborative Computing, University of Electronic Science and Technology of China（电子科技大学智能协同计算实验室）； School of Civil Engineering, University of Khartoum（喀土穆大学土木工程学院）； National Energy Research Center, Ministry of Higher Education and Scientific Research（高等教育部和科学研究部国家能源研究中心）

AI总结提出SST-CD框架，利用空间选择性自训练和局部一致性准则，从无标签双时相遥感图像中学习建筑变化检测器，在三个数据集上超越现有无监督方法。

详情

Comments: Under Review

AI中文摘要

无监督建筑变化检测旨在从未标记的双时相遥感图像中学习建筑变化掩膜。现有的无标签方法通常遵循差异到掩膜范式，直接使用时相差异、冻结的基础模型响应、基于提示的输出或后处理结果作为最终变化图。尽管这些策略提供了无标注线索，但它们并未学习任务特定的建筑变化检测器，并且仍然容易受到通用时相差异与建筑定义的结构变化之间的差距的影响。在实践中，这种差异通常是嘈杂且与任务无关的，因为外观变化、配准误差和非建筑修改可能产生强烈但误导性的响应。为了解决这个问题，我们提出了SST-CD，一种空间选择性自训练框架，将完全无标签的建筑变化检测重新表述为在嘈杂伪监督下的端到端检测器学习。SST-CD使用时相差异作为候选伪标签，并仅在空间可靠像素上训练检测器，其可靠性通过局部一致性准则估计，该准则从监督中过滤不一致区域。为了进一步稳定嘈杂的自训练，一个轻量级特征适配器重新校准双时相特征，而基于原型的解码器产生紧凑的变化和无变化表示。在LEVIR-CD、WHU-CD和DSIFN-CD上的实验表明，SST-CD分别达到了83.08%、91.69%和86.60%的F1分数，优于现有的无监督和无标签基线。代码将公开提供。

英文摘要

Unsupervised building change detection aims to learn building-change masks from unlabeled bi-temporal remote sensing images. Existing label-free methods often follow a discrepancy-to-mask paradigm, directly using temporal differences, frozen foundation-model responses, prompt-based outputs, or post-processing results as final change maps. Although these strategies provide annotation-free cues, they do not learn a task-specific building-change detector and remain vulnerable to the gap between generic temporal discrepancies and building-defined structural changes. In practice, such discrepancies are often noisy and task-irrelevant, as appearance shifts, registration errors, and non-building modifications can produce strong but misleading responses. To address this problem, we propose SST-CD, a spatially selective self-training framework that reformulates fully label-free building change detection as end-to-end detector learning under noisy pseudo supervision. SST-CD uses temporal discrepancies as candidate pseudo labels and trains the detector only on spatially reliable pixels, whose reliability is estimated by a local consistency criterion that filters inconsistent regions from supervision. To further stabilize noisy self-training, a lightweight feature adapter recalibrates bi-temporal features, while a prototype-based decoder produces compact change and no-change representations. Experiments on LEVIR-CD, WHU-CD, and DSIFN-CD show that SST-CD achieves F1 scores of 83.08%, 91.69%, and 86.60%, respectively, outperforming existing unsupervised and label-free baselines.

URL PDF HTML ☆

赞 0 踩 0