arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.09892 2026-06-10 cs.LG stat.ME 新提交

LMT: A Bayesian Framework for Causal Discovery from Textual Alarm Records in Manufacturing Systems

LMT: 制造系统中文本告警记录的因果发现贝叶斯框架

Xiaofeng Xiao, Jianhong Chen, Qiuzhuang Sun, Naichen Shi, Xubo Yue

AI总结提出LMT框架，结合大语言模型提取的语义信号和基于泊松过程的时间证据，通过贝叶斯方法从文本告警记录中发现因果图，在小样本场景下表现优异。

Comments 19 pages

详情

AI中文摘要

文本事件记录（如告警日志）已成为工程和制造系统中越来越常见的数据源。除了识别相关性或重复模式外，工程师通常有兴趣了解在系统运行过程中哪些类型的事件因果性地触发或影响其他事件。文本事件描述可能包含关于此类因果关系的语义线索，而最近的大语言模型（LLM）为提取这些信号提供了有前景的工具。然而，仅依赖LLM编码的文本信息不足以进行准确的因果发现，因为语义模式并不直接揭示因果机制，并且可能将因果关系与相关性或频繁的顺序模式混淆。为了解决这些挑战，我们提出了\textbf{LMT}，一个用于工程事件数据的贝叶斯因果发现框架，它联合利用了文本描述和时间戳。具体来说，LMT首先使用LLM从事件描述中提取语义因果信号，并构建事件类型或事件簇之间因果图的先验分布。然后，它通过基于泊松过程的似然函数纳入时间证据，使得基于时间戳的统计证据能够精炼LLM信息先验。通过整合文本和时间信息，LMT生成一个既可解释又有数据支持的因果图。模拟研究表明，所提出的框架在不同设置下都是有效的，并且在样本量较小的告警事件场景中尤其具有优势。

英文摘要

Textual event records, such as alarm logs, have become an increasingly common data source in engineering and manufacturing systems. Beyond identifying correlations or recurring patterns, engineers are often interested in understanding which types of events causally trigger or influence other events during system operation. Textual event descriptions may contain semantic clues about such causal relationships, and recent large language models (LLMs) provide a promising tool for extracting these signals. However, relying solely on LLM-encoded textual information is insufficient for accurate causal discovery, since semantic patterns do not directly reveal causal mechanisms and may confuse causation with correlation or frequent sequential patterns. To address these challenges, we propose \textbf{LMT}, a Bayesian causal discovery framework for engineering event data that jointly leverages textual descriptions and timestamps. Specifically, LMT first uses LLMs to extract semantic causal signals from event descriptions and constructs a prior distribution over causal graphs among event types or event clusters. It then incorporates temporal evidence through a Poisson-process-based likelihood, allowing the LLM-informed prior to be refined by timestamp-based statistical evidence. By integrating the textual and temporal information, LMT produces a causal graph that is both interpretable and data-supported. Simulation studies show that the proposed framework is effective across different settings and is especially advantageous in small-sample alarm-event scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.09891 2026-06-10 cs.LG cs.AI cs.IR 新提交

Representation Curriculum: Stagewise Training for Robust Ranking and Allocation

表示课程：用于鲁棒排序和分配的分阶段训练

Ehsan Ebrahimzadeh, Sina Baharlouei, Abraham Bagherjeiran

AI总结提出表示课程（RC）方法，通过分阶段引入特征，先强调基于内容的信号，再引入依赖曝光的信号，减少对历史信号的捷径依赖，提升冷启动泛化性和鲁棒性。

Comments 12 pages, 5 figures

详情

DOI: 10.1145/3770855.3818470

AI中文摘要

数字市场中的排序是一种动态曝光分配机制：展示的物品塑造了发现轨迹和成功事件，平台记录这些事件以更新未来的分配策略。现代排序系统严重依赖曝光混杂信号（如流行度估计、CTR/CVR聚合和基于ID的表示），因为这些信号在静态需求下具有高度预测性。然而，这种预测能力可能成为一种学习捷径：早期访问依赖曝光的信念信号会使优化过度依赖它们，而忽视独立于曝光的价值信号（如基于内容的竞争力和语义亲和性）。因此，学习到的策略倾向于固化现有物品，并在分布偏移下降低冷启动泛化性和鲁棒性。我们提出表示课程（RC），一种训练时干预方法，按时间阶段安排特征使用。RC首先突出基于内容的价值信号，然后引入依赖曝光的信念信号，同时将内容路径锚定在学到的价值表示附近，从而抑制对历史信号的捷径依赖，并缓解内容信号上的梯度饥饿。我们形式化RC，使其独立于任务和假设类，并提供排序特定的实例化。在高斯线性岭回归设置中，我们推导出封闭形式解和充分条件，证明RC在冷启动目标分布上严格降低总体风险，并量化了与源性能的帕累托权衡。在公开的排序学习和推荐基准测试，以及大规模电商搜索系统中的随机在线实验中，RC显著地将依赖从历史信念信号转向基于内容的价值信号，并在头部性能可控权衡下，对冷群体带来一致的提升。

英文摘要

Ranking in digital marketplaces is a dynamic exposure-allocation mechanism: displayed items shape discovery trajectories and success events logged by the platform to update future allocation policies. Modern ranking systems rely heavily on exposure-confounded signals (e.g. popularity estimates, CTR/CVR aggregates, and ID-based representation), because they are highly predictive under stationary demand. Yet this predictive power can become a learning shortcut: early access to exposure-dependent belief signals steers optimization toward over-reliance on them and away from exposure-independent merit signals (e.g., content-based competitiveness and semantic affinity). Consequently, the learned policy tends to entrench incumbents and degrade cold-start generalization and robustness under distribution shift. We propose Representation Curriculum (RC), a training-time intervention that temporally stages feature utilization. RC foregrounds content-based merit signals initially, then introduces exposure-dependent belief signals while anchoring the content pathway near the learned merit representation, curbing shortcut reliance on historical signals and mitigating gradient starvation on content signals. We formalize RC independently of task and hypothesis class and provide ranking-specific instantiations. In a Gaussian linear ridge setting, we derive closed-form solutions and sufficient conditions under which RC strictly reduces population risk on a cold-start target distribution, with a quantified Pareto tradeoff against source performance. Experiments on public learning-to-rank and recommendation benchmarks, and randomized online experiments in a large-scale e-commerce search system, show that RC measurably shifts reliance from historical belief signals toward content-based merit signals and yields consistent gains on cold populations with a controlled trade-off in head performance.

URL PDF HTML ☆

赞 0 踩 0

2606.09890 2026-06-10 cs.LG cs.AI cs.CL 新提交

PreAct-Bench: Benchmarking Predictive Monitoring in LLMs

PreAct-Bench：大语言模型中的预测性监控基准

Hainiu Xu, Italo Luis da Silva, Jiangnan Ye, Yuhao Wang, Wei Liu, Linyi Yang, Jonathan Richard Schwarz, Nicola Paoletti, Yulan He, Hanqi Yan

AI总结提出预测性监控任务，在动作执行前判断是否会导致不道德行为，并构建PreActBench基准，评估多种模型发现该任务具有挑战性。

详情

AI中文摘要

大语言模型（LLMs）越来越多地被部署为能够执行多步动作轨迹以实现给定目标的自主代理。虽然现有的安全研究集中于从完整轨迹中检测不道德行为，但这种范式本质上是回顾性的：它仅在伤害已经发生后识别伤害。在这项工作中，我们研究了一个关键但被忽视的安全任务，我们称之为预测性监控：仅给定部分动作轨迹，模型能否在执行公开动作之前推断出它是否会以不道德行为告终？为了支持这一任务，我们提出了PreActBench，一个包含1000个跨五个领域的成对道德和不道德动作轨迹的基准。我们使用我们的前缀远见F1指标，在动作轨迹的不同部分上评估了一系列LLMs、安全护栏模型和潜在探测方法。结果表明，尽管人类取得了有希望的性能，但即使对于强模型，预测性监控仍然具有挑战性，突显了在LLM安全中需要面向未来的风险推理。

英文摘要

Large language models (LLMs) are increasingly deployed as autonomous agents capable of executing multi-step action trajectories toward a given objective. While existing safety research has focused on detecting unethical behavior from complete trajectories, this paradigm is fundamentally retrospective: it identifies harm only after it has already occurred. In this work, we study a critical yet overlooked safety task, which we term Predictive Monitoring: given only a partial action trajectory, can a model infer whether it will culminate in an unethical action before the overt action is executed? To support this task, we present PreActBench, a benchmark of 1,000 paired ethical and unethical action trajectories spanning five domains. We evaluate a range of LLMs, safety guardrail models, and latent probing methods across varying fractions of the action trajectory using our Prefix Foresight F1 metric. Results show that while humans achieve promising performance, predictive monitoring remains challenging even for strong models, highlighting the need for future-oriented risk reasoning in LLM safety.

URL PDF HTML ☆

赞 0 踩 0

2606.09889 2026-06-10 cs.LG 新提交

Optuna Constrained Tree-Structured Parzen Estimator Is a Joint Density Generalization of c-TPE

Optuna约束树结构Parzen估计器是c-TPE的联合密度推广

Shuhei Watanabe, Kaichi Irie

AI总结本文证明Optuna的约束TPE是联合c-TPE，使用联合似然的ECI采集函数，并展示其对约束重复的鲁棒性优于独立c-TPE。

2606.09888 2026-06-10 cs.LG 新提交

SinkRec: Mitigating Semantic State Sink in Long Sequence Recommendation with Memory-Conditioned Gated Delta Networks

SinkRec: 通过记忆条件门控Delta网络缓解长序列推荐中的语义状态沉没

Zhuang Zhuang, Zhipeng Wei, Ji Dai, Jie Chen, Fei Pan, Peng Jiang, Kun Gai

AI总结针对线性注意力在长序列推荐中因重复行为模式导致的语义状态沉没问题，提出SinkRec混合记忆-循环架构，通过残差向量量化外部化重复模式，并设计TDGD模块净化读写过程，在保持线性时间效率的同时提升推荐效果。

详情

AI中文摘要

线性注意力通过避免标准Transformer的二次成本，为长序列推荐提供了高效的骨干网络，但其压缩的循环状态可能被重复行为模式主导。我们将此现象识别为语义状态沉没，即重复语义过度占据循环状态并偏置后续读出。为了缓解语义状态沉没，我们提出SinkRec，一种混合记忆-循环循环架构，将协同行为模式存储与动态转换建模解耦。SinkRec通过残差向量量化将重复局部模式外部化到可学习的条件记忆中，重新注入检索到的编码，并将记忆键值对暴露给注意力块。它进一步引入了时域感知状态关系差分门控DeltaNet（TDGD），通过抑制记忆覆盖的更新和移除记忆对齐的读出响应，利用记忆净化循环写入和读取。该设计将重复语义从状态竞争信号转变为记忆可检索模式，使循环状态专注于动态转换，并以线性时间效率缓解语义状态沉没。在公共和工业数据集上的实验证明了SinkRec的有效性和效率。

英文摘要

Linear attention provides an efficient backbone for long-sequence recommendation by avoiding the quadratic cost of standard Transformers, but its compressed recurrent state can be dominated by repetitive behavior patterns. We identify this phenomenon as semantic state sink, where recurring semantics over-occupy the recurrent state and bias subsequent readouts. To mitigate semantic state sink, we propose SinkRec, a hybrid memory-transition looped architecture that decouples collaborative behavioral pattern storage from dynamic transition modeling. SinkRec externalizes recurring local patterns into a learnable conditional memory through residual vector quantization, reinjects the retrieved codes, and exposes memory key-value pairs to the attention block. It further introduces Temporal-Aware State-Relation Differential Gated DeltaNet (TDGD), which uses memory to purify recurrent writing and reading by suppressing memory-covered updates and removing memory-aligned readout responses. This design turns recurring semantics from state-competing signals into memory-retrievable patterns, allowing the recurrent state to focus on dynamic transitions and alleviating semantic state sink with linear-time efficiency. Experiments on public and industrial datasets demonstrate the effectiveness and efficiency of SinkRec.

URL PDF HTML ☆

赞 0 踩 0

2606.09887 2026-06-10 cs.LG cs.AI cs.CL 新提交

SocraticPO: Policy Optimization via Interactive Guidance

SocraticPO: 通过交互式指导进行策略优化

Zirui Liu, Jie Ouyang, Qi Liu, Xianquan Wang, Jiayu Liu, Tingyue Pan, Qingchuan Li, Jing Sha, Zhenya Huang, Shijin Wang, Enhong Chen

AI总结提出SocraticPO框架，在强化学习中使用自然语言指导辅助推理，并通过奖励衰减防止模型依赖教师帮助，提升科学推理任务性能。

详情

AI中文摘要

用于大语言模型的强化学习通常使用标量结果奖励（如二元正确性）来监督推理。这种奖励提供了优化方向，但很少解释模型应如何修正其错误推理，这可能鼓励捷径学习和脆弱的策略。我们提出\textbf{SocraticPO}（苏格拉底式策略优化），一种策略优化框架，用苏格拉底式的自然语言指导增强强化学习展开。在展开过程中，学生首先独立回答；如果答案错误，教师诊断尝试并提供简洁的纠正性指导，之后学生在扩展的上下文下继续。关键的是，这种指导与奖励衰减配对：在教师干预后获得的正确答案只得到衰减的奖励，防止策略将教师帮助视为获取奖励的免费途径。由于SocraticPO只修改展开过程，而保持标准期望奖励目标不变，它可以插入到现有的策略梯度后端（如Reinforce++）中。此外，由于教师只提供文本级指导，SocraticPO可以利用更强的黑盒教师模型，而无需访问logits或分布匹配。在来自SciKnowEval的本科水平科学推理基准上，SocraticPO优于强强化学习和自蒸馏基线。消融实验表明，目标指导和奖励衰减都是必要的，奖励衰减减轻了对辅助纠正的依赖。

英文摘要

Reinforcement learning (RL) for large language models usually supervises reasoning with scalar outcome rewards, such as binary correctness. Such rewards provide an optimization direction but rarely explain how a model should revise its mistaken reasoning, which can encourage shortcut learning and brittle policies. We propose \textbf{SocraticPO} (Socratic Policy Optimization), a policy-optimization framework that augments RL rollouts with Socratic-style natural-language guidance. During rollout, the student first answers independently; if the answer is incorrect, a teacher diagnoses the attempt and provides concise corrective guidance, after which the student continues under the expanded context. Crucially, this guidance is paired with reward decay: correct answers obtained after teacher intervention only receive decayed rewards, preventing the policy from treating teacher help as a free path to reward. Since SocraticPO only modifies the rollout process while leaving the standard expected-reward objective intact, it can be plugged into existing policy-gradient backends such as Reinforce++. Moreover, because the teacher provides only text-level guidance, SocraticPO can leverage stronger black-box teacher models without requiring access to logits or distribution matching. On undergraduate-level scientific reasoning benchmarks from SciKnowEval, SocraticPO improves over strong RL and self-distillation baselines. Ablations show that both targeted guidance and reward decay are necessary, with reward decay mitigating reliance on assisted correction.

URL PDF HTML ☆

赞 0 踩 0

2606.09886 2026-06-10 cs.LG cs.AI 新提交

SHAPE: Coalition-Aware Expert Pruning for Sparse Mixture-of-Experts LLMs

SHAPE: 面向稀疏混合专家大语言模型的联盟感知专家剪枝

Yuhao Zhang

AI总结提出SHAPE框架，通过合作博弈论建模专家间协作，利用Shapley值识别高贡献专家，结合质量覆盖选择规则在剪枝预算下保留关键专家，实验表明在多种MoE模型上提升鲁棒性并降低显存。

详情

AI中文摘要

稀疏混合专家（MoE）大语言模型以低每token计算量实现了高质量，但其部署常受限于内存墙：必须保留全部专家池以支持依赖token的路由。专家剪枝是一种直接解决方案，但先前的标准通常独立评估专家，忽略了MoE推理本质上是“联盟性”的，即输出由路由到的top-$k$专家组合产生。我们提出\textbf{SHAPE}，一个任务驱动的剪枝框架，显式建模\textit{层内}专家协作。SHAPE将小校准集上的路由轨迹建模为经验合作博弈，并通过基于观察到的top-$k$联盟的Shapley式归因分配交互感知的专家价值，从而识别对高效用协作至关重要的专家，而不仅仅是频繁出现的专家。为了在全局剪枝预算下保持MoE拓扑，SHAPE进一步引入\textit{质量-覆盖}选择规则，在每层保留覆盖非负Shapley质量$\alpha$分数的最小专家子集，同时使用二分法匹配目标保留率。在三个现代MoE骨干网络（Qwen3-30B-A3B、GPT-OSS-20B和DeepSeek-V2-Lite）上的多个基准实验表明，SHAPE在20%和40%专家剪枝下，相比全局和逐层剪枝变体一致地提升了鲁棒性，无需额外训练即保持竞争性精度，并显著降低了峰值GPU内存占用。开源代码见此https URL。

英文摘要

Sparse Mixture-of-Experts (MoE) large language models achieve strong quality with low per-token compute, yet their deployment is often limited by the memory wall: the full expert pool must remain resident to support token-dependent routing. Expert pruning is a direct remedy, but prior criteria typically score experts independently and overlook that MoE inference is inherently \emph{coalitional}, where outputs arise from routed top-$k$ expert combinations. We propose \textbf{SHAPE}, a task-driven pruning framework that explicitly models \emph{intra-layer} expert cooperation. SHAPE formulates routing traces on a small calibration set as an empirical cooperative game and assigns interaction-aware expert values via a Shapley-style attribution over observed top-$k$ coalitions, enabling the identification of experts that are essential for high-utility collaborations rather than merely frequent. To preserve MoE topology under a global pruning budget, SHAPE further introduces a \emph{quality-coverage} selection rule that retains, in each layer, the minimal expert subset covering an $α$ fraction of non-negative Shapley mass, while using bisection to match a target keep rate. Experiments on three modern MoE backbones (Qwen3-30B-A3B, GPT-OSS-20B, and DeepSeek-V2-Lite) across diverse benchmarks show that SHAPE consistently improves robustness over global and layer-wise pruning variants, maintaining competitive accuracy under 20\% and 40\% expert pruning without additional training and delivering clear reductions in peak GPU memory footprint. The open-source code is available at https://github.com/Alizen-1009/Shapley-Moe.

URL PDF HTML ☆

赞 0 踩 0

2606.09885 2026-06-10 cs.LG stat.ML 新提交

TENP: Trapezoidal Expert Neuron Pruning For Mixture-of-Experts

TENP：用于混合专家的梯形专家神经元剪枝

Jiangyang He, Shaolin Zhu, Deyi Xiong

AI总结提出TENP框架，通过识别重要专家并对其余专家进行神经元剪枝，保留梯形参数模式，在40%路由专家稀疏度和平均63.76%激活参数下，DeepSeek模型准确率仅下降1点，代码生成任务提升10%。

详情

AI中文摘要

混合专家大语言模型通过稀疏激活实现高效扩展，但其部署受到专家大量静态参数占用的根本限制。现有压缩方法要么移除整个专家，破坏路由拓扑并损害性能，要么依赖非结构化权重剪枝，实际效率有限。为解决这些局限，我们提出TENP，一种结构化的梯形专家神经元剪枝框架。使用少量样本，我们识别并保留重要专家，同时对次要专家应用专家神经元剪枝（ENP），从浅层到深层以梯形模式保留模型参数。在评估专家重要性时，我们联合考虑专家输出的幅度及其改变输入向量方向的能力。对于ENP，我们测量每个神经元对专家输出的投影贡献，以识别并保留重要神经元。我们在Qwen和DeepSeek模型上进行了广泛实验。在路由专家稀疏度为40%且平均激活63.76%专家参数的情况下，DeepSeek模型相比全参数模型准确率仅下降1点。此外，在代码生成任务上，它比全参数模型提升10%。

英文摘要

Mixture-of-Experts large language models (LLMs) scale efficiently through sparse activation, yet their deployment is fundamentally constrained by the large static parameter footprint of experts. Existing compression approaches either remove entire experts, disrupting routing topology and harming performance, or rely on unstructured weight pruning with limited practical efficiency. To address the limitations, we propose TENP, a structured Trapezoidal ExpertNeuron Pruning framework. Using a few samples, we identify and retain important experts, while applying expert neuron pruning (ENP) to less important experts, reserving model parameters in a trapezoidal pattern from shallow to deep layers. When evaluating expert importance, we jointly consider both the magnitude of the expert output and its ability to change the direction of the input vector. For ENP, we measure each neuron's projected contribution to the expert output to identify and retain important neurons. We conduct extensive experiments on the Qwen and DeepSeek models. Under a routing expert sparsity of 40% and an average of 63.76% activated expert parameters, the DeepSeek model suffers only a 1-point drop in accuracy compared to the full-parameter model. Moreover, it outperforms the full-parameter model by 10% on code generation tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.09883 2026-06-10 cs.LG cs.AI 新提交

TD-Grokking: Learning from Zero-Reward Problems by Training-Time Decomposition

TD-Grokking：通过训练时分解从零奖励问题中学习

Ningyuan Xi, Hao Xu, Hongsheng Xin, Ning Miao

AI总结针对强化学习在零奖励问题上无法提供优化信号的问题，提出训练时分解框架TD-Grokking，将难解问题递归分解为可验证子问题，在数学和医疗任务上优于基线方法。

详情

AI中文摘要

大型语言模型（LLMs）在推理任务上取得了显著进展，这主要归功于后训练范式，特别是基于可验证奖励的强化学习（RLVR）。然而，一个关键瓶颈依然存在：RLVR在极具挑战性的零奖励问题上失败，因为所有采样的推理轨迹都产生统一失败的结果，无法提供优化信号来驱动模型改进。先前解决这一限制的努力，如密集过程监督、部分奖励分配或前缀引导探索，受到固有任务约束的限制，或者未能完全赋予策略模型解决原始难解问题所需的能力。为了解决这个问题，我们提出了TD-Grokking，一个针对零奖励问题的训练时分解框架。它递归地将难解的根问题分解为自包含、可验证的子问题，形成层次树，其中可解的叶子节点提供非零奖励。在数学和医疗任务上的评估表明，TD-Grokking优于普通的GRPO以及所有基线方法。结合详细分析，这些结果证实训练时分解有效地将零奖励示例转化为可用的训练信号，从而实现一致的性能提升。我们的代码和数据集可在以下网址获取：https://this URL。

英文摘要

Large language models (LLMs) have made remarkable progress in reasoning tasks, largely driven by post-training paradigms, especially reinforcement learning with verifiable rewards (RLVR). However, a critical bottleneck persists: RLVR fails on highly challenging zero-reward problems, where all sampled reasoning trajectories yield uniformly failed outcomes, providing no optimization signal to drive model improvement. Prior efforts to address this limitation, such as dense process supervision, partial reward assignment, or prefix-guided exploration, suffer from inherent task constraints or do not fully equip the policy model with the capabilities necessary to solve the original intractable problems. To address this, we propose TD-Grokking, a training-time decomposition framework for zero-reward problems. It recursively decomposes intractable root problems into self-contained, verifiable subproblems, forming hierarchical trees where solvable leaves provide non-zero rewards. Evaluations on mathematical and medical tasks show that TD-Grokking outperforms vanilla GRPO as well as all baseline approaches. Together with detailed analysis, these results confirm that training-time decomposition effectively converts zero-reward examples into usable training signals, enabling consistent performance gains. Our code and datasets are available at https://anonymous.4open.science/r/TD-Grokking-6567/.

URL PDF HTML ☆

赞 0 踩 0

2606.09882 2026-06-10 cs.CV cs.LG 新提交

WHU-Infra3D: A Full-stack Multi-modal Dataset and Benchmark for 3D Roadside Infrastructure Inventory

WHU-Infra3D：面向3D路边基础设施清单的全栈多模态数据集与基准

Chong Liu, Luxuan Fu, Xuyu Feng, Zhen Dong, Bisheng Yang

AI总结提出WHU-Infra3D多模态基准数据集，覆盖三城市53.8公里，融合全景图像与LiDAR点云，提供2D-3D实例关联和跨帧跟踪，支持基础设施状态诊断与属性识别，填补自动化维护数据集空白。

详情

AI中文摘要

数字孪生城市的范式正从粗略的视觉映射转向更精确、可操作的城市资产数字化。然而，现有数据集主要关注粗略的视觉感知，缺乏自动化基础设施维护所需的严格多模态对齐和属性及状态诊断。为弥合这一差距，我们引入了WHU-Infra3D，一个大规模、多模态的基准数据集，专门用于路边基础设施清单。覆盖三个城市53.8公里，WHU-Infra3D独特地集成了全景图像和LiDAR点云，并具有严格的2D-3D实例关联和跨帧跟踪。该数据集包含超过17.5万个多视图2D边界框以及数千个3D基础设施实例，提供了超过18.1万个详细的属性和状态注释（例如，锈蚀、遮挡），以支持运行健康评估。我们在五个核心任务上建立了全面的基线：2D检测、2D跨视图匹配、3D地理识别、3D点云分割和属性识别。广泛的评估暴露了当前模型在长尾缺陷状态上的显著跨城市领域差距和固有脆弱性，使WHU-Infra3D成为推进可扩展、AI驱动的城市基础设施清单和生命周期管理的重要试验场。WHU-Infra3D数据集可在以下网址获取：https://xxx。

英文摘要

The paradigm of digital twin cities is shifting from coarse visual mapping toward more precise and actionable digitization of urban assets. However, existing datasets predominantly focus on coarse visual perception, lacking the strict multi-modal alignment and attribute and status diagnosis required for automated infrastructure maintenance. To bridge this gap, we introduce WHU-Infra3D, a large-scale, multi-modal benchmark dataset dedicated to roadside infrastructure inventory. Covering 53.8 km across three cities, WHU-Infra3D uniquely integrates panoramic imagery and LiDAR point clouds with rigorous 2D-3D instance association and cross-frame tracking. Comprising over 175k multi-view 2D bounding boxes alongside thousands of 3D infrastructure instances, the dataset provides over 181k detailed attribute and status annotations (e.g., rust, occlusion) to empower operational health assessment. We establish comprehensive baselines across five core tasks: 2D detection, 2D cross-view matching, 3D geo-identification, 3D point cloud segmentation, and attribute recognition. Extensive evaluations expose significant cross-city domain gaps and inherent vulnerabilities of current models on long-tailed defective statuses, establishing WHU-Infra3D as an essential testbed for advancing scalable, AI-driven urban infrastructure inventory and lifecycle management. The WHU-Infra3D dataset is available at https://github.com/WHU-USI3DV/WHU-Infra3D.

URL PDF HTML ☆

赞 0 踩 0

2606.09881 2026-06-10 cs.LG cs.CR cs.CV 新提交

Toward Calibrated, Fair, and accurate Deepfake Detection

迈向校准、公平且准确的深度伪造检测

Ryan Brown, Chris Russell

AI总结提出Face-Fairness框架，通过Face-Feature Tuning实现无需人口统计标签的深度伪造检测公平性，同时保持或提升整体准确率。

详情

AI中文摘要

深度伪造检测器在不同人口群体间表现出较大的性能差距。现有的公平性方法需要人口统计标签、重新训练或牺牲准确性。我们引入了Face-Fairness (FF)，一个即插即用的偏差缓解框架。我们的主要贡献是Face-Feature Tuning (FFT)，这是首个在深度伪造检测中展示的无人口统计标签的公平性方法：一个轻量级校准器，基于冻结的人脸嵌入进行logit重映射。我们通过两种变体补充FFT：FF-Max，在人口统计标签可用时最大化最差组准确率；以及FF-Discover，通过嵌入发现的组实现相同目标。在域内和跨数据集测试设置中，FF一致地减少了FPR/TPR差距，提高了最小组准确率，同时保持（通常提升）整体准确率。该方法与检测器无关，增加了可忽略的运行时开销，并且不需要访问身份属性。

英文摘要

Deepfake detectors show large performance gaps across demographic groups. Existing fairness approaches require demographic labels, retraining, or sacrifice accuracy. We introduce Face-Fairness (FF), a plug-and-play framework for bias mitigation. Our primary contribution, Face-Feature Tuning (FFT), is the first demographic label-free fairness method demonstrated for deepfake detection: a lightweight calibrator that performs a logit remapping conditioned on frozen face embeddings. We complement FFT with two variants: FF-Max, which maximizes worst-group accuracy when demographics are available, and FF-Discover, which does the same with embedding-discovered groups. Across in-domain and cross-dataset test settings, FF consistently reduces FPR/TPR gaps and improves minimum group accuracy while maintaining (often improving) overall accuracy. The approach is detector-agnostic, adds negligible runtime overhead, and requires no access to identity attributes.

URL PDF HTML ☆

赞 0 踩 0

2606.09880 2026-06-10 cs.LG 新提交

Hyperparameter Learning for Latent Factorization of Tensors for Representation Learning to Large-scale Dynamic Weighted Directed Network

面向大规模动态加权有向网络表示学习的张量潜在因子超参数学习

Yaqian Zhan, Jialan He, Tianzhu Chen

AI总结针对大规模动态加权有向网络，提出基于差分进化的张量潜在因子模型超参数自动优化框架，自动学习正则化参数，提升预测精度并减少人工调参。

详情

AI中文摘要

大规模动态加权有向网络（DWDNs）被广泛用于建模节点间的时变交互。张量潜在因子（LFT）通过低秩嵌入从DWDNs中提取目标知识。然而，与许多机器学习模型类似，LFT的性能在很大程度上依赖于超参数的选择。在实践中，这些参数通常通过手动或网格搜索进行调整，这需要大量的计算资源和人力。受此挑战的启发，本文提出了一种基于差分进化（DE）的LFT自动超参数优化框架（DE-LFT）。该方法将DE集成到LFT模型的训练过程中，以自动学习最优正则化参数$\lambda_1$、$\lambda_2$和$\lambda_3$。因此，模型能够自适应地搜索超参数空间并提高预测精度。在四个真实世界数据集上的实验结果表明，与手动调优的基线相比，所提方法实现了更低的MAE和RMSE，同时减少了对大量参数调优的需求。

英文摘要

Large-scale dynamic weighted directed networks (DWDNs) are widely used to model time-varying interactions among nodes. Latent factorization of tensors (LFT) extracts target knowledge from DWDNs via low-rank embedding. However, similar to many machine learning models, the performance of LFT heavily depends on the selection of hyperparameters. In practice, these parameters are often tuned manually or through grid search, which requires significant computational resources and human effort. Motivated by this challenge, this paper proposes an automated hyperparameter optimization framework based on Differential Evolution (DE) for LFT (DE-LFT). The proposed method integrates DE into the training process of the LFT model to automatically learn optimal regularization parameters $λ_1$, $λ_2$ and $λ_3$. As a result, the model can adaptively search the hyperparameter space and improve prediction accuracy. Experimental results on four real-world datasets demonstrate that the proposed approach achieves lower MAE and RMSE compared with manually tuned baselines while reducing the need for extensive parameter tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.09879 2026-06-10 cs.LG 新提交

Operator Fusion for LLM Inference on the Tensix Architecture

面向Tensix架构的LLM推理算子融合

Qingbo Wu, Ke Li, Wenzhu Wang, Jie Yu, Ruian Zhang, Lili Liu

AI总结针对Tensix架构的LLM推理瓶颈，提出RMSNorm与矩阵乘法融合的算子融合策略，利用片上SRAM和NoC多播减少DRAM读写与调度开销，在Wormhole平台上实现注意力延迟降低37.44%、MLP延迟降低15.89%，且数值一致性保持98.75%以上。

Comments 11 pages, 5 figures

详情

AI中文摘要

本研究解决了Tenstorrent Tensix架构上Transformer模型的设备端推理瓶颈，并提出了一种增强数据局部性的算子融合策略。RMSNorm与自注意力和FFN中的矩阵乘法融合，使得内存受限和计算受限的算子能够在片上SRAM中连续执行，从而显著减少中间结果的DRAM读写次数和调度开销。为了支持多核并行，利用基于NoC的多播机制，其中行/列主节点高效地将输入和权重分发到核心网格，缓解DRAM带宽争用。在Wormhole平台上使用Qwen2.5-0.5B、Qwen3-0.6B和Qwen3-4B进行的实验显示，注意力延迟降低高达37.44%，MLP延迟降低15.89%，每解码层延迟降低高达7.91%，同时皮尔逊相关系数（PCC）保持在98.75%以上，证实了在数值一致性下显著的端到端效率提升。

英文摘要

This study addresses on-device inference bottlenecks of Transformer models on Tenstorrent's Tensix architecture and proposes an operator fusion strategy that enhances data locality. RMSNorm is fused with matrix multiplication in self-attention and in the FFN, enabling back-to-back execution of memory-bound and compute-bound operators in on-chip SRAM to significantly reduce DRAM reads/writes of intermediate results and scheduling overhead. To support multi-core parallelism, a NoC-based multicast mechanism is leveraged in which row/column master nodes efficiently distribute inputs and weights across the core mesh, alleviating DRAM bandwidth contention. Experiments on the Wormhole platform with Qwen2.5-0.5B, Qwen3-0.6B, and Qwen3-4B show up to 37.44% latency reduction for attention and 15.89% for MLP, with up to 7.91% reduction per decoder layer, while Pearson Correlation Coefficient (PCC) remains above 98.75%, confirming significant end-to-end efficiency gains under numerical consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.09878 2026-06-10 cs.LG 新提交

FailureScope: Cross-Regime Behavioral Diagnosis of Language Model Weaknesses

FailureScope: 语言模型弱点的跨机制行为诊断

Nicholas Saban

AI总结提出FailureScope方法，通过跨模型通过/失败模式聚类评估探针，在单轮基准、多轮对话和对抗性代理攻击三种机制下稳定生成可解释的失败分类，实现高效任务采样和跨模型失败预测。

详情

AI中文摘要

标准基准报告聚合准确率，但从业者需要知道模型缺乏哪些特定能力。我们引入FailureScope，一种行为诊断方法，通过跨模型通过/失败模式（留一模法，LOMO）对评估探针进行聚类，并展示其在通常单独研究的三种机制（单轮基准、多轮对话和对抗性代理攻击）中产生稳定、可解释的失败分类。在18个模型的2,664个单轮任务上，基于分类的采样在50个任务时达到Kendall's tau=0.81（随机选择为0.34），跨模型失败预测达到AUC 0.88。相同的原语在363个任务的多轮语料库和630个对抗性代理轨迹上恢复出可解释的聚类，其中揭示了一个元失败模式：LLM评判ASR与实际执行之间存在73-100个百分点的差距。所有三种机制的聚类凝聚力保持强劲，我们认为这证明了行为聚类是一种可移植的诊断原语，能够泛化到任何单一基准之外。我们发布了该流程、三个带注释的语料库以及跨机制分类。

英文摘要

Standard benchmarks report aggregate accuracy, but practitioners need to know which specific capabilities a model lacks. We introduce FailureScope, a behavioral-diagnosis method that clusters evaluation probes by their cross-model pass/fail patterns (leave-one-model-out, LOMO), and show it yields stable, interpretable failure taxonomies across three regimes usually studied separately: single-turn benchmarks, multi-turn dialogue, and adversarial agent attacks. On 2,664 single-turn tasks across 18 models, taxonomy-conditioned sampling reaches Kendall's tau = 0.81 at 50 tasks (versus 0.34 for random selection), and cross-model failure prediction reaches AUC 0.88. The same primitive recovers interpretable clusters on a 363-task multi-turn corpus and on 630 adversarial agent traces, where it exposes a meta-failure mode: a 73-100 percentage-point gap between LLM-judge ASR and real execution. Cluster cohesion remains strong across all three regimes, which we take as evidence that behavioral clustering is a portable diagnosis primitive that generalizes beyond any single benchmark. We release the pipeline, three annotated corpora, and the cross-regime taxonomies.

URL PDF HTML ☆

赞 0 踩 0

2606.09877 2026-06-10 cs.LG cs.CE cs.CL 新提交

Streaming Knowledge Compilation: Proactive Materiality-Scored Pinning for Time-Evolving LLM Wikis

流式知识编译：面向时变LLM维基的主动物质性评分固定

Juan M. Huerta

AI总结提出流式知识编译框架，通过物质性信号φ_t主动固定重要文档，在金融和维基百科领域验证O(√T log K)遗憾界，并揭示LLM评判偏差。

详情

AI中文摘要

LLM维基系统将知识编译为预填充的KV缓存以实现高效推理，但假设语料库是静态的——当底层信息格局演变时，这一假设失效。我们形式化流式知识编译：给定文档流、固定令牌预算以及在摄取时未知的未来查询，维护一个编译后的维基，使其相对于具有完美预见力的离线oracle的累积遗憾最小化。关键洞察是物质性信号φ_t(k,n)∈[0,1]，它对时间t实体k的文档重要性进行评分，作为查询相关性的代理，在查询到达前主动固定；我们证明O(√T log K)遗憾界，其中ε=E[|φ_t-φ̂_t|]是唯一的领域特定量。我们在两个领域实例化：金融领域，其中φ_t是由冻结的Llama 3.1 8B分类头预测的异常股票波动率（在76K篇文章上AUROC=0.728，严格时间分割；预测为物质性的文章实现1.49倍更高的实际远期波动率）；以及维基百科领域，其中φ_t是异常编辑比率（AER），一种横截面标准化的编辑速度——表明同一算法可泛化到金融领域之外。在173个匹配对（金融）和119个（维基百科）上的端到端QA评估揭示了训练后知识上普遍的LLM-as-judge混淆，确立了遗憾分析——而非绝对QA分数——是编译知识系统的可靠评估指标。金融累积遗憾收敛至-20.0（-0.12/步）；维基百科收敛至+16.0（+0.13/步），正号确认维基百科编辑内容确实是训练后的——更丰富的上下文持续提高分数（无维基3.80 vs. Oracle 4.74）——并消除了这一混淆。O(√T log K)保证适用于任何知识差距可从流信号预测的领域。

英文摘要

LLM wiki systems compile knowledge into pre-filled KV caches for efficient inference, but assume a static corpus -- an assumption that fails whenever the underlying information landscape evolves. We formalize Streaming Knowledge Compilation: given a document stream, a fixed token budget, and future queries unknown at ingestion time, maintain a compiled wiki that minimizes cumulative regret against an offline oracle with perfect foresight. The enabling insight is a materiality signal $ϕ_t(k,n)\in[0,1]$ that scores document importance for entity $k$ at time $t$, acting as a query-relevance surrogate for proactive pinning before queries arrive; we prove an $O(\sqrt{T\log K})$ regret bound where $\varepsilon=\mathbb{E}[|ϕ_t-\hatϕ_t|]$ is the only domain-specific quantity. We instantiate in two domains: finance, where $ϕ_t$ is abnormal stock volatility predicted by frozen Llama 3.1 8B classification head (AUROC = 0.728 on 76K articles, strict temporal split; $1.49\times$ higher realized forward volatility for predicted-material articles); and Wikipedia, where $ϕ_t$ is the Abnormal Edit Ratio (AER), a cross-sectionally normalized edit velocity -- showing the same algorithm generalizes beyond the finance domain. End-to-end QA evaluation on 173 matched pairs (finance) and 119 (Wikipedia) reveals a pervasive LLM-as-judge confound on post-training knowledge, establishing that regret analysis -- not absolute QA scores -- is the reliable evaluation metric for compiled knowledge systems. Finance cumulative regret converges to -20.0 (-0.12/step); Wikipedia to +16.0 (+0.13/step), with the positive sign confirming that Wikipedia edit content is genuinely post-training -- richer context consistently improves scores (No Wiki 3.80 vs. Oracle 4.74) -- and eliminates this confound. The $O(\sqrt{T\log K})$ guarantee applies to any domain where knowledge gaps can be predicted from streaming signals.

URL PDF HTML ☆

赞 0 踩 0

2606.09876 2026-06-10 cs.LG 新提交

Calibrating Overconfidence Without Sacrificing Confidence: Probe-Conditioned Head Intervention for LLMs

校准过度自信而不牺牲自信：基于探针条件的头部干预方法用于LLMs

Ke Li, Chongzhe Zhang, Zifan Zeng, Feng Liu, Qunli Zhang, Zheng Hu

AI总结提出PCHI方法，通过冻结探针检测可能错误但自信的响应，并条件性调整注意力头输出，在保持正确自信的同时减少过度自信，将ECE从21.9%降至9.2%。

Comments 11 pages, 4 figures

详情

AI中文摘要

大型语言模型常常对错误答案表现出高置信度。标准的校准方法通常全局或分数级别操作，减少无根据的自信，但也冒着侵蚀正确答案上有根据的自信的风险。我们引入了基于探针条件的头部干预（PCHI），一种推理时方法，使用冻结探针检测可能错误但自信的响应，并在置信度生成期间条件性地重新缩放下游注意力头的输出。在Qwen3-4B-Instruct解决OpenMathInstruct问题（具有结构化二进制置信度字段）上，读取令牌PCHI将82.2%的原始错误-是置信度读数转换为$\texttt{no}$，而跨上游置信度模板令牌的联合干预将ECE从21.9%降至9.2%，并且仅损坏5.1%的原始正确-是读数。读取令牌效应也出现在Gemma3-4B上，尽管上游干预较弱且更依赖于掩码。这些结果表明，通过条件性应用的内部干预，可以选择性地减少口头表达的过度自信，部分解耦抑制无根据自信与损失有根据自信之间的关系。

英文摘要

Large language models often express high confidence in answers that are wrong. Standard calibration remedies typically act globally or at the score level, reducing unwarranted confidence but also risking erosion of warranted confidence on correct answers. We introduce Probe-Conditioned Head Intervention (PCHI), an inference-time method that uses a frozen probe to detect likely wrong-but-confident responses and conditionally rescales downstream attention-head outputs during confidence generation. On Qwen3-4B-Instruct solving OpenMathInstruct problems with a structured binary confidence field, readout-token PCHI converts 82.2% of originally wrong-yes confidence readouts to $\texttt{no}$, while a joint intervention across upstream confidence-template tokens reduces ECE from 21.9% to 9.2% and damages only 5.1% of originally correct-yes readouts. The readout-token effect also appears on Gemma3-4B, though upstream interventions are weaker and more mask-dependent. These results show that verbalized overconfidence can be selectively reduced through conditionally applied internal intervention, partially decoupling the suppression of unwarranted confidence from the loss of warranted confidence.

URL PDF HTML ☆

赞 0 踩 0

2606.09875 2026-06-10 cs.LG cs.AI stat.ML 新提交

Integrating Local and Global Entropy for Uncertainty Quantification in LLMs

集成局部和全局熵用于大语言模型的不确定性量化

Johanne Medina, Tianyi Zhou, Keivin Isufaj, Aristides Gionis, Sanjay Chawla

AI总结本文提出GLU方法，通过融合隐藏状态几何熵（全局）和token级熵（局部）来量化LLM不确定性，有效捕捉自信但错误的失败模式，无需额外训练。

Comments 17 pages, 2 figures

详情

AI中文摘要

大语言模型会自信地产生幻觉，使得不确定性量化（UQ）对于可靠部署至关重要。现有方法主要依赖token级信号，而中间隐藏状态的几何结构未被充分利用。在本文中，我们将隐藏状态矩阵的几何复杂度作为LLM全局不确定性的度量，同时将token级不确定性估计视为局部度量。我们表明，隐藏状态几何熵（全局不确定性）和token级熵（局部不确定性）在统计上近似正交，捕捉了可靠性预测的不同失败模式。特别地，全局几何恢复了局部信号系统性遗漏的自信但错误的失败模式。基于此，我们提出了全局-局部不确定性（GLU），这是一种无监督、单次前向传播的分数，通过乘法门融合两种信号。在三个模型族和六个基准测试中，GLU匹配或优于所有无监督基线，同时仅需一次前向传播，且保持长度归一化和架构无关性。

英文摘要

Large language models hallucinate confidently, making uncertainty quantification (UQ) essential for reliable deployment. Existing methods rely predominantly on token-level signals, leaving the geometric structure of intermediate hidden states underused. In this paper, we take the geometric complexity of hidden-state matrices as a measure of the global uncertainty of LLMs, while treating token-level uncertainty estimation as a local metric. We show that hidden-state geometric entropy (global uncertainty) and token-level entropy (local uncertainty) are statistically near-orthogonal, capturing distinct failure regimes for reliability prediction. In particular, global geometry recovers the confident-but-wrong failure mode that local signals systematically miss. Building on this, we propose Global-Local Uncertainty (GLU), an unsupervised, single-pass score that fuses the two signals via a multiplicative gate. Across three model families and six benchmarks, GLU matches or outperforms all unsupervised baselines while requiring only a single forward pass and remaining length-normalized and architecture-agnostic.

URL PDF HTML ☆

赞 0 踩 0

2606.09874 2026-06-10 cs.LG stat.ML 新提交

Disjoint or Overlapping? Inference Windowing for Reconstruction-Based Time Series Anomaly Detection

不相交还是重叠？基于重构的时间序列异常检测中的推理窗口化

Guillaume Coulaud, Reza Akbarinia, Florent Masseglia

AI总结研究推理步长（重叠窗口）对基于重构的时间序列异常检测性能的影响，提出统一评估协议，实验表明重叠窗口平均提升28%且改变方法排名。

详情

AI中文摘要

基于重构的方法广泛用于时间序列异常检测，其中模型被训练来重构子序列，并通过重构误差识别异常。然而，由于异构的评估实践和不明确的推理过程，报告的结果往往难以比较。在本文中，我们重新审视单变量离线设置下的基于重构的异常检测，并研究推理步长的作用，该步长控制子序列是作为不相交窗口处理还是重叠处理。我们在精心策划的TSB-AD基准上提出了一个统一的训练、调优和多种子评估协议，并研究了重叠推理如何影响一系列重构模型的异常检测性能，包括基于PCA的基线、DLinear、AutoEncoder、TimesNet和Transformer变体。结果表明，在所有模型中，重叠窗口带来一致的改进，平均相对增益高达+28%，并且可以改变方法排名。我们进一步分析了跨数据集、随机种子和超参数配置的变异性。最后，我们使用与滑动窗口重构对齐的定位标准，在完整的UCR存档上补充了基准研究。总体而言，我们的结果强调，基于重构的异常检测性能不仅取决于模型架构和训练，还取决于推理选择，这促使采用清晰且可重复的协议。我们的结果表明，基于重构的基线在TSB-AD和UCR基准上都取得了强劲的性能，支持它们作为单变量时间序列异常检测的竞争性和实用方法。

英文摘要

Reconstruction-based methods are widely used for time series anomaly detection, where models are trained to reconstruct subsequences, and anomalies are identified through reconstruction errors. However, reported results are often hard to compare due to heterogeneous evaluation practices and underspecified inference procedures. In this paper, we revisit reconstruction-based anomaly detection in the univariate offline setting and study the role of the inference stride, which controls whether subsequences are processed as disjoint windows or with overlap. We propose a unified training, tuning, and multi-seed evaluation protocol on the curated TSB-AD benchmark, and study how overlapping inference affects anomaly detection performance for a range of reconstruction models, including PCA-based baselines, DLinear, an AutoEncoder, TimesNet, and Transformer variants. The results show that across all models, overlapping windows yield consistent improvements, with average relative gain up to +28%, and can alter method rankings. We further analyze variability across datasets, random seeds, and hyperparameter configurations. Finally, we complement the benchmark study with an evaluation on the full UCR archive using localization criteria aligned with sliding-window reconstruction. Overall, our results highlight that reconstruction-based anomaly detection performance depends not only on model architecture and training, but also on inference choices, motivating a clear and reproducible protocol. Our results show that reconstructionbased baselines achieve strong performance on both TSB-AD and UCR benchmarks, supporting them as competitive and practical approaches for univariate time series anomaly detection.

URL PDF HTML ☆

赞 0 踩 0

2606.09873 2026-06-10 cs.LG cs.AI 新提交

Rotate2Think: Geometric Priming via Orthogonal Rotation to Improve Language Model Reasoning

Rotate2Think：通过正交旋转进行几何提示以提升语言模型推理能力

Aditya Sharma, Christopher J. Pal, Amal Zouaq

AI总结发现推理模型的输入嵌入与思考嵌入存在高锥度且方向非共线，提出无训练方法Rotate2Think，通过正交Procrustes分析估计旋转并注入合成思考向量，在30/32配置中提升数学、科学和代码任务准确率。

详情

AI中文摘要

推理模型通过生成显式的中间推理轨迹再给出最终答案，在挑战性任务上取得了强劲表现。然而，推理过程中表示空间的内部结构仍知之甚少：模型的隐藏表示在思考时与输入提示的嵌入有何不同？这种结构能否被利用以在推理时激发更强的推理能力？我们表明，输入嵌入和思考嵌入（分别对提示和推理轨迹的最后一层隐藏状态进行均值池化）都表现出极高的锥度，所有向量紧密聚集在单一平均方向周围。关键的是，这些平均输入方向和思考方向是非共线的，思考嵌入在嵌入空间中占据了几何上不同的区域，这在许多不同模型和基准任务中均成立。这一观察启发我们将输入到思考的转换视为一个旋转问题，该问题可通过正交Procrustes分析得到闭式解。我们提出Rotate2Think，一种无需训练的方法，从少量正确求解的示例中估计该旋转，并在推理时将生成的合成思考向量注入思考分隔符之间，在推理轨迹开始时提供几何提示。在多个基准和模型家族上的评估表明，Rotate2Think在数学、科学和代码任务的32个模型-基准配置中改进了30个的准确率，并零样本泛化到MATH-Vision上的多模态推理。

英文摘要

Reasoning models achieve strong performance on challenging tasks by generating explicit intermediate reasoning traces before producing a final answer. Yet the internal structure of representation space when reasoning remains poorly understood: how do a model's hidden representations differ during thinking versus the embeddings of the input prompt, and can this structure be exploited to elicit stronger reasoning at inference time? We show that both input embeddings and thinking embeddings (mean-pooled last-layer hidden states over the prompt and reasoning trace, respectively) exhibit extremely high conicity, with all vectors clustering tightly around a single mean direction. Crucially, these mean input and thinking directions are non-collinear, with thinking embeddings occupying a geometrically distinct region of embedding space across many different models and benchmark tasks. This observation motivates casting the input-to-thinking transition as a rotation problem admitting a closed-form solution via orthogonal Procrustes analysis. We propose Rotate2Think, a training-free method that estimates this rotation from a small set of correctly solved examples and injects the resulting synthetic thinking vector between thinking delimiters at inference time, providing a geometric primer at the onset of the reasoning trace. Evaluated across multiple benchmarks and model families, Rotate2Think improves accuracy in 30 of 32 model-benchmark configurations across mathematics, science, and code tasks, and generalizes zero-shot to multimodal reasoning on MATH-Vision.

URL PDF HTML ☆

赞 0 踩 0

2606.09872 2026-06-10 cs.LG cs.AI 新提交

PatchSTG: Scalable Spatiotemporal Graph Transformers for Traffic Forecasting on Irregular Sensor Networks

PatchSTG：面向不规则传感器网络的交通预测可扩展时空图Transformer

Jichao Li, Xuanming Shi

AI总结提出PatchSTG，通过地理信息将传感器划分为平衡的局部补丁，并采用双注意力编码器交替捕捉局部和全局依赖，实现计算复杂度从二次降至近线性，在不规则传感器网络上取得高效且稳定的交通预测性能。

Comments 22 pages,12 figures

详情

AI中文摘要

交通预测是智能交通系统的基本组成部分，但由于传感器分布不规则以及建模大规模时空依赖的高计算成本，在实际环境中仍然具有挑战性。在实际交通网络中，传感器在不同区域分布不均，导致空间结构不均匀，限制了现有基于图和基于注意力模型的有效性和可扩展性。为了解决这些挑战，我们提出了PatchSTG，一种基于补丁的时空图Transformer，专为不规则传感器网络上的高效预测而设计。关键思想是引入一种层次化空间表示，基于地理信息将传感器划分为平衡且保持局部性的补丁。在此结构之上，双注意力编码器交替进行补丁内注意力（捕捉局部交互）和补丁间注意力（建模全局依赖），将计算复杂度从二次降低到近线性。我们在罗德岛的真实交通数据以及额外的大规模数据集上评估了PatchSTG。实验结果表明，所提模型在多个预测时域上实现了稳定且具有竞争力的预测性能，同时显著提高了计算效率。消融研究进一步验证了空间划分和双注意力在捕捉局部和长程交通动态方面的有效性。这些结果表明，基于补丁的时空建模为不规则空间设置下的交通预测提供了一个可扩展且有效的框架。

英文摘要

Traffic forecasting is a fundamental component of intelligent transportation systems, yet remains challenging in real-world settings due to irregular sensor distributions and the high computational cost of modeling large-scale spatiotemporal dependencies. In practical traffic networks, sensors are unevenly distributed across regions, leading to non-uniform spatial structures that limit the effectiveness and scalability of existing graph-based and attention-based models. To address these challenges, we propose PatchSTG, a patch-based spatiotemporal graph Transformer designed for efficient forecasting on irregular sensor networks. The key idea is to introduce a hierarchical spatial representation that partitions sensors into balanced, locality-preserving patches based on geographic information. On top of this structure, a dual attention encoder alternates between intra-patch attention for capturing local interactions and inter-patch attention for modeling global dependencies, reducing computational complexity from quadratic to near-linear scaling. We evaluate PatchSTG on real-world traffic data from Rhode Island and additional large-scale datasets. Experimental results demonstrate that the proposed model achieves stable and competitive forecasting performance across multiple horizons, while significantly improving computational efficiency. Ablation studies further validate the effectiveness of spatial partitioning and dual attention in capturing both local and long-range traffic dynamics. These results suggest that patch-based spatiotemporal modeling provides a scalable and effective framework for traffic forecasting under irregular spatial settings.

URL PDF HTML ☆

赞 0 踩 0

2606.09871 2026-06-10 cs.CV cs.AI cs.LG 新提交

SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation

SD-GRPO：面向长格式视觉-语言生成的可验证片段分解

Hyunwoong Kim, Seongeun Lee, Hannah Yun, Junhyun Park, Jonggwon Park

AI总结提出SD-GRPO方法，通过将长格式输出分解为片段并计算逐片段优势，解决GRPO在视觉-语言任务中粗粒度信用分配不足的问题，实验证明其在多种长格式生成任务中优于基线。

详情

AI中文摘要

群体相对策略优化（GRPO）及其变体最初为大型语言模型（LLM）开发，最近被应用于多模态LLM并取得了强劲结果。然而，它们基于单一标量优势的粗粒度整体信用分配在视觉-语言（VL）任务中拟合不足，这些任务的输出通常是基于语义丰富图像的长格式响应。为解决这一限制，我们利用了一种单标量公式丢弃的结构化信号：长格式VL输出的自然分段。具体地，我们提出片段分解GRPO（SD-GRPO），它对整个rollout组中可验证的逐片段奖励进行z归一化，生成一个逐片段优势向量以替代单一标量。我们在三个设置中评估SD-GRPO，涵盖受控和真实世界的长格式VL生成，按片段间语义纠缠程度递增组织。在从DOCCI构建的受控多面板密集字幕任务中（片段语义独立），SD-GRPO始终优于GRPO基线，且片段数量越多增益越大。扩展到从MultiChartQA构建的受控多图表长格式VQA任务，我们从理论和经验上证明，rollout级奖励存在随输出长度增加而加剧的跨片段信用错误归因。在MMSci数据集上的真实世界科学图表字幕任务中（子图字幕共享图表上下文），混合整体和逐片段奖励进一步提升了两者性能，表明当片段语义纠缠时，仅逐片段归一化是不够的。最后，通过将SD-GRPO集成到Dr. GRPO中，我们确认它可以以最小的实现开销应用于任何GRPO框架，以增强长格式VL生成。

英文摘要

Group Relative Policy Optimization (GRPO) and its variants, originally developed for Large Language Models (LLMs), have recently been applied to Multimodal LLMs and produced strong results. However, their coarse-grained holistic credit assignment from a single scalar advantage underfits vision-language (VL) tasks, where outputs are often long-form responses grounded in semantically rich images. To address this limitation, we exploit a structured signal that single-scalar formulations discard: the natural segmentation of long-form VL outputs. Concretely, we propose Segment-Decomposed GRPO (SD-GRPO), which z-normalizes verifiable per-segment rewards across the rollout group, yielding a vector of per-segment advantages in place of a single scalar. We evaluate SD-GRPO across three settings spanning controlled and real-world long-form VL generation, organized by increasing semantic entanglement across segments. On a controlled multi-panel dense-captioning task constructed from DOCCI, where segments are semantically independent, SD-GRPO consistently outperforms the GRPO baseline, with larger gains at higher segment counts. Extending to a controlled multi-chart long-form VQA task constructed from MultiChartQA, we show both theoretically and empirically that rollout-level rewards suffer from cross-segment credit misattribution that scales with output length. On a real-world scientific figure captioning task on the MMSci dataset, where subfigure captions share context across the figure, blending holistic and per-segment rewards further improves on both, suggesting per-segment normalization alone is insufficient when segments are semantically entangled. Finally, by integrating SD-GRPO into Dr. GRPO, we confirm that it can be applied to any GRPO framework with minimal implementation overhead to enhance long-form VL generation.

URL PDF HTML ☆

赞 0 踩 0

2606.09869 2026-06-10 cs.LG cs.AI cs.CR 新提交

QSplitFL: Capability Aware Deep Q-Learning for Optimal Split Point Selection in Split Federated Learning

QSplitFL: 基于能力感知的深度Q学习在分割联邦学习中的最优分割点选择

Nazmus Shakib Shadin, Xinyue Zhang, Jingyi Wang, Miao Pan

AI总结提出QSplitFL框架，利用深度Q网络基于客户端硬件指标（CPU、内存、电池、网络延迟）动态选择最优分割点，解决异构设备上的分割联邦学习挑战，通过衰减损失奖励函数和委员会投票机制提升收敛速度和精度。

Comments Accepted by ECML-PKDD 2026

详情

AI中文摘要

联邦学习（FL）与分割学习（SL）结合是一种隐私保护范式，能够在资源受限设备上训练深度神经网络（DNN），同时降低整体训练成本。然而，确定最优分割点（即模型被分割的层）仍然是一个关键挑战，尤其是当客户端具有异构硬件能力时。固定分割点可能使弱设备过载，增加通信和服务器负载，从而减慢收敛速度并降低稳定性。本文介绍了QSplitFL，一种新颖的基于能力感知的深度Q网络（DQN）框架，用于在基于分割学习的联邦学习（SFL）环境中选择最优分割点。与依赖高维模型权重表示的现有方法不同，QSplitFL采用直接从客户端硬件指标（包括CPU利用率、内存、电池电量和网络延迟）导出的轻量级状态表示。所提出的框架包含一个衰减损失下降奖励函数，优先考虑早期收敛，以及一个基于委员会的DQN架构，通过多数投票来减轻奖励黑客攻击。在MNIST、Fashion-MNIST、CIFAR-10和CIFAR-100数据集上，使用CNN、ResNet50、MobileNetV4和ConvNeXt架构进行的广泛实验表明，我们的方法在收敛速度和精度上优于现有方法，同时有效适应异构设备资源。源代码在此https URL公开可用。

英文摘要

Federated Learning (FL) combined with Split Learning (SL) is a privacy preserving paradigm that enables training deep neural networks (DNNs) on resource constrained devices while reducing overall training cost. However, determining the optimal split point, meaning the layer where the model is divided still remains a critical challenge, especially when clients have heterogeneous hardware capabilities. Fixed split points can overload weak devices and increase the communication and server load, which slows convergence and reduces stability. This paper introduces QSplitFL, a novel capability-aware Deep Q-Network (DQN) framework for optimal split point selection in Split learning based Federated Learning (SFL) environments. Unlike existing approaches that rely on high-dimensional model weight representations, QSplitFL employs a lightweight state representation derived directly from client hardware metrics, including CPU utilization, memory, battery level, and network latency. The proposed framework incorporates a decayed loss-drop reward function that prioritizes early convergence, and a committee-based DQN architecture with majority voting to mitigate reward hacking. Extensive experiments on MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets using CNN, ResNet50, MobileNetV4, and ConvNeXt architectures demonstrate that our approach achieves better convergence and higher accuracy compared to existing methods, while effectively adapting to heterogeneous device resources. The source code is publicly available at https://github.com/AIPO-Lab/QSplitFL.

URL PDF HTML ☆

赞 0 踩 0

2606.09866 2026-06-10 cs.LG cs.AI 新提交

Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning

双人探戈：面向安全LLM微调的耦合任务-参考选择

Xinrui Chen, Jianhao Zhang, Ou Wu, Di Gao

AI总结提出DualSelect框架，通过耦合任务与安全参考选择，在微调时保持安全对齐，提升安全评分至少5.10点。

详情

AI中文摘要

在下游数据上微调安全对齐的大型语言模型（LLMs）可以提高适应性，但可能会侵蚀已学习的安全行为。现有方法使用固定的安全示例、全局约束或单边任务过滤。我们的诊断表明，任务更新暴露了不同的安全约束，从而激发了联合选择相关参考和兼容任务样本的需求。我们提出DualSelect，一个耦合的任务和参考选择框架，它在过滤与诱导参考方向兼容的整个任务样本之前，刷新任务条件化的安全参考。在极小极大视角下，DualSelect通过熵正则化评分代理、惰性参考刷新和梯度校正，选择具有高保留损失和任务冲突的安全参考以及兼容的任务样本。在1B-8B LLMs上，DualSelect在不损失任务效用的情况下保持安全性；使用REDORCA评估器，它在安全平均值上比最强基线至少提高5.10分，并且在所有评估器中保持最高的安全平均值，且开销适中。这一观点扩展到以保留为中心的持续学习。

英文摘要

Fine-tuning safety aligned large language models (LLMs) on downstream data improves adaptation but may erode learned safety behavior. Existing methods use fixed safety examples, global constraints, or one-sided task filtering. Our diagnostics show task updates expose different safety constraints, motivating joint selection of relevant references and compatible task samples. We propose DualSelect, a coupled framework for task and reference selection that refreshes task conditioned safety references before filtering whole task samples compatible with the induced reference direction. Under a minimax view, DualSelect selects safety references with high preservation loss and task conflict, together with compatible task samples, through entropy-regularized scoring surrogates, lazy reference refresh, and gradient correction. On 1B-8B LLMs, DualSelect preserves safety without losing task utility; using the REDORCA judge, it improves Safety Avg. over the strongest baseline by at least 5.10 points and remains highest in Safety Avg. across judges with moderate overhead. This view extends to retention focused continual learning.

URL PDF HTML ☆

赞 0 踩 0

2606.09865 2026-06-10 cs.LG cs.CR cs.IR 新提交

LLM-as-a-Discriminator: When Synthetic Tables Still Look Real

LLM作为判别器：当合成表格看起来仍然真实

Manel Slokom, Malek Slokom, Thierno Kante

AI总结提出用LLM区分真实与合成表格数据，测试不同设置和模型，发现LLM判别可作为实用的隐私审计信号。

详情

AI中文摘要

隐私和数据共享常常处于紧张状态。许多组织使用合成数据来降低隐私风险，同时仍能共享有用的数据。对于表格数据，审计隐私仍然困难。在许多情况下，即使是人类也很难判断一个表格是真实的还是合成的。在本文中，我们提出了一种基于LLM判别的方法。我们要求LLM将每个表格样本分类为真实或合成。我们测试了两种设置：C1仅包含表格，C2包含表格和分布元数据。我们使用LLaMA作为开放模型，Gemini作为参考模型。在我们的实验中，我们在两个公共数据集UCI Adult和ACS Census上运行了三种合成模型：CTGAN、TVAE和Gaussian Copula。我们收集了451个有效试验。我们的结果显示模型之间存在明显差异。在Adult上，LLaMA在报告单元格中达到DRS=0%，而Gemini对CTGAN和TVAE达到DRS=100%。在Census上，LLaMA预测大多数样本为合成，而Gemini在C1中保持高值，但在C2中对CTGAN和TVAE下降。我们还与分类器双样本检验（C2ST）和记录链接作为分布基线进行了比较，并与2名标注员和240次试验的人类试点进行了比较。我们的结果表明，当模型选择、每个提供者的报告和数据编码得到谨慎处理时，LLM判别是一种实用的隐私审计信号。为了可重复性，代码和实验脚本可在以下网址获得：https://this URL。

英文摘要

Privacy and data sharing are often in tension. Many organizations use synthetic data to reduce privacy risk and still share useful data. For tabular data, auditing privacy remains hard. In many cases, even humans cannot easily tell if a table is real or synthetic. In this paper, we propose a method based on LLM discrimination. We ask an LLM to classify each table sample as REAL or SYNTHETIC. We test two settings: C1 with table only, and C2 with table plus distributional metadata. We use LLaMA as an open model and Gemini as a reference model. In our experiments, we run three synthesis models, CTGAN, TVAE, and Gaussian Copula, on two public datasets, UCI Adult and ACS Census. We collect 451 valid trials. Our results show clear differences between models. On Adult, LLaMA reaches DRS=0% in reported cells, while Gemini reaches DRS=100% for CTGAN and TVAE. On Census, LLaMA predicts SYNTHETIC for most samples, while Gemini stays high in C1 but drops for CTGAN and TVAE in C2. We also compare with a classifier two-sample test (C2ST) and record linkage as distributional baselines, and with a human pilot of 2 annotators and 240 trials. Our results show that LLM discrimination is a practical privacy audit signal when model choice, per provider reporting, and data encoding are handled with care. For reproducibility, code and experiment scripts are available at https://github.com/SlokomManel/LLM-as-a-Discriminator.

URL PDF HTML ☆

赞 0 踩 0

2606.09864 2026-06-10 cs.LG cs.AI cs.ET 新提交

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

KV缓存量化下的对齐崩溃：诊断与缓解

Bruce Changlong Xu, Adarsh Kumarappan, Mu Zhou

AI总结研究发现低比特KV缓存量化会无声破坏大模型的安全对齐，根源在于安全特征位于低维激活子空间，易受量化噪声影响；提出逐通道缩减（PCR）诊断方法，分类三种失效模式并指导缓解，无需训练即可恢复高达97%的对齐损失。

Comments Preprint. 61 pages, 9 figures

详情

AI中文摘要

键值（KV）缓存量化被广泛用于减少大语言模型（LLM）推理内存，然而现有评估仅关注困惑度和准确率，未评估安全性影响。在本研究中，我们探索了KV缓存量化下的对齐保持。在11个指令微调模型（3.8B-72B）和5个基准（1,894个提示）上，我们发现低比特量化可以无声地破坏安全对齐：Mistral-7B在仅1.03倍困惑度下丢失了15.2%的拒绝能力，且不存在通用的安全位宽，标准指标无法察觉的尖锐模型特定相变普遍存在。我们识别出根本原因是几何性的：安全特征占据一个低维激活子空间，其对量化噪声的脆弱性比困惑度平均的全表示空间高10^2-10^3倍。受此观察启发，我们提出逐通道缩减（PCR），一种诊断方法，将每个模型分类为三种机制性失效模式之一：异常值压碎安全（安全位于非异常值通道，被异常值驱动的缩放因子连带损害）；异常值即安全（安全与异常值通道重叠，更细粒度无法挽救）；多层稀释（安全分布在许多层，逐层修复失败）。PCR在全部9个主要模型和来自独立家族的1个保留模型上，使用20个校准提示预测了正确的缓解方向。PCR泛化到未见过的提示、模型和生产量化器，包括KIVI，恢复率高达97.2%，而基于注意力的分配方法失败。由此产生的免训练协议，大约需要35 GPU分钟，以最小的内存开销恢复高达97%的丢失对齐，解决了在NVIDIA GPU上使用FP8 KV缓存的生产vLLM服务中确认的漏洞。

英文摘要

Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study, we explore alignment preservation under KV cache quantization. Across eleven instruction-tuned models (3.8B-72B) and five benchmarks (1,894 prompts), we find that low-bit quantization can silently destroy safety alignment: Mistral-7B loses 15.2% of its refusals at only 1.03x perplexity, and no universal safe bit-width exists, with sharp model-specific phase transitions invisible to standard metrics. We identify that the root cause is geometric: safety features occupy a low-dimensional activation subspace 10^2-10^3x more vulnerable to quantization noise than the full representation space perplexity averages over. Inspired by this observation, we propose Per-Channel Reduction (PCR), a diagnostic that classifies each model into one of three mechanistic failure modes: outlier-crushes-safety, where safety lives in non-outlier channels collaterally damaged by outlier-driven scale factors; outlier-as-safety, where safety overlaps outlier channels and finer granularity cannot rescue it; and multi-layer dilution, where safety is distributed across many layers and per-layer fixes fail. PCR predicts the correct mitigation direction on all nine primary models and one held-out model from an independent family using 20 calibration prompts. PCR generalizes across unseen prompts, models, and production quantizers, including KIVI with up to 97.2% recovery, succeeding where attention-based allocation methods fail. The resulting training-free protocol, requiring approximately 35 GPU-minutes, recovers up to 97% of lost alignment at minimal memory overhead, addressing vulnerabilities confirmed in production vLLM serving with FP8 KV cache on NVIDIA GPUs.

URL PDF HTML ☆

赞 0 踩 0

2606.09862 2026-06-10 cs.LG cs.AI 新提交

Blurry Window Attention

模糊窗口注意力

Axel Laborieux, Christos Sourmpis, Juan Gabriel Kostelec, Qinghai Guo

AI总结提出模糊窗口注意力（BLA），一种基于Dirichlet核插值重构模糊KV历史的有界记忆控制方法，在合成任务中状态效率比滑动窗口注意力高8倍，且随状态增大性能提升。

详情

AI中文摘要

Transformer语言模型中的Softmax注意力操作在序列长度上具有二次复杂度，且状态大小以KV缓存形式增长，这成为长上下文场景中的瓶颈。为克服此限制，引入了具有线性复杂度和有限状态大小的替代架构，如状态空间模型（SSM）、线性注意力（LA）和有界记忆控制注意力（ABC）。尽管线性模型在语言困惑度上与Transformer相当，但在需要检索或回忆特定信息的任务中仍落后。本文提出模糊窗口注意力（BLA），一种受SSM启发的新型ABC方法。BLA存储一个频率窗口，通过使用Dirichlet核进行插值从中重建模糊的KV历史。根据Dirichlet核的分辨率，BLA可理解为滑动窗口注意力（SWA）的泛化，或门控槽注意力（GSA）的特例，其中衰减因子由Dirichlet核实现。我们详细描述了BLA的理论和高效实现。在多查询关联回忆（MQAR）合成任务上，我们表明BLA的状态效率比SWA高8倍，且与流行的线性注意力模型竞争；在RegBench合成任务中，在我们测试的线性模型中，只有BLA和SWA随着状态大小增长而提升性能。

英文摘要

The Softmax Attention operation in Transformer language models has a quadratic complexity in the sequence length and a growing state size in the form of KV cache, which becomes a bottleneck in long context scenarios. To overcome this limitation, alternative architectures with linear complexity and finite state size have been introduced, such as State-Space Models (SSMs), Linear Attention (LA), and Attention with Bounded-memory Control (ABC). Though linear models achieve similar language perplexity as Transformers, they are still behind in tasks which require retrieval or recall of specific information. In this work, we introduce Blurry Window Attention (BLA) a novel ABC method inspired by SSMs. BLA stores a frequency window from which a blurry KV history is reconstructed via interpolation using Dirichlet kernels. BLA can be understood as a generalization of Sliding Window Attention (SWA) depending on the Dirichlet kernels resolution or as a special case of the Gated Slot Attention (GSA), where the decay factor is implemented with Dirichlet kernels. We describe in details the theory and efficient implementation of BLA. On the Multi-Query Associate Recall (MQAR) synthetic task, we show that the state efficiency of BLA is 8$\times$ better than SWA and is competitive with popular linear attention models, and in the RegBench synthetic task, only BLA and SWA improve their performance as the state size grows among the linear models we tested.

URL PDF HTML ☆

赞 0 踩 0

2606.09861 2026-06-10 cs.LG cs.AI 新提交

Time Series as Language: A Universal Tokenizer for General-Purpose Time Series Foundation Models

时间序列作为语言：通用时间序列基础模型的通用分词器

Yunhao Zhang, Ruiying Qi, Jiale Zheng, Jianfeng Zhang, Lujia Pan, Junchi Yan

AI总结提出UniTok通用分词器将时间序列转化为离散令牌，并基于NTP预训练UniTok-FM基础模型，支持零样本预测、提示增强预测以及少样本生成和分类，无需任务特定修改。

详情

AI中文摘要

虽然下一个令牌预测（NTP）统一了LLM的预训练，但其对无界、连续时间序列（TS）的适应仍然是一个开放问题。为了弥合这一差距，我们引入了UniTok，一个将TS转化为离散令牌的通用分词器，以及UniTok-FM，一个在这些令牌上通过NTP预训练的基础模型。UniTok-FM是一个通用基础模型，支持零样本和提示增强的预测，以及通过无训练上下文推理进行的少样本生成和分类——这是先前工作未能实现的能力。在技术上，UniTok是一个向量量化自编码器，结合了前缀归一化以实现尺度稳定、渐进分辨率因果架构用于编码和解码，以及结构保持重建损失用于训练。UniTok-FM采用现成的LLM架构，无需针对TS的特定修改。它不是在孤立的TS上预训练，而是在由多个具有相似模式的序列形成的上下文窗口上执行NTP，旨在捕捉它们的共享动态。在预测、生成和分类上的实验表明，单个统一的UniTok-FM始终优于统计和监督基线，与任务特定的基础模型性能相当，并且独特地实现了跨任务的无训练上下文推理。

英文摘要

While Next-Token Prediction (NTP) has unified LLM pretraining, its adaptation to unbounded, continuous time series (TS) remains open. To bridge the gap, we introduce UniTok, a universal tokenizer that transforms TS into discrete tokens, and UniTok-FM, a foundation model pretrained via NTP on these tokens. UniTok-FM is a general-purpose foundation model that supports zero-shot and prompt-boosted forecasting, as well as few-shot generation and classification via training-free in-context inference--a capability not achieved by prior works. Technically, UniTok is a vector-quantized autoencoder incorporating prefix normalization for scale stabilization, a progressive-resolution causal architecture for encoding and decoding, and a structure-preserving reconstruction loss for training. UniTok-FM adopts an off-the-shelf LLM architecture without TS-specific modifications. Instead of pretraining on isolated TS, it performs NTP on context windows formed by multiple series with similar patterns, aiming to capture their shared dynamics. Experiments on forecasting, generation, and classification show that a single unified UniTok-FM consistently outperforms statistical and supervised baselines, achieves competitive performance with task-specific foundation models, and uniquely enables training-free in-context inference across tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.09860 2026-06-10 cs.LG cs.AI stat.AP stat.ML 新提交

Conformal Risk Prediction for Non-Alcoholic Fatty Liver Disease Using Gradient Boosting with Distribution-Free Coverages

基于梯度提升与无分布覆盖的非酒精性脂肪肝病共形风险预测

Xinze Zhang

AI总结提出结合梯度提升决策树与共形预测的机器学习框架Method，实现非酒精性脂肪肝病个体风险的无分布校准覆盖预测，在中国多中心队列中AUROC达0.912，优于多种方法。

详情

AI中文摘要

非酒精性脂肪肝病（NAFLD）影响全球约25%的成年人，带来显著的肝脏和心血管风险。然而，人群层面的筛查工具仍不充分。我们提出Method，一种用于NAFLD风险预测的机器学习框架，将梯度提升决策树与共形预测相结合，以在个体风险估计上产生校准的、无分布的覆盖保证。它集成了基于互信息的稳定性选择过程，通过自助重采样识别紧凑、临床可解释的特征子集，构建预测集，其边际覆盖可证明超过用户指定的置信水平。我们在中国广州的多中心队列（主要n=2,187；外部验证n=412）上评估了Method，使用了涵盖人口统计学、代谢生物标志物和生活方式因素的78个候选特征。Method内部AUROC为0.912，外部为0.891，优于深度神经网络、TabNet、支持向量机和逻辑回归。共形预测集在90%名义水平下达到91.3%的经验覆盖。从这些分数得出的三层风险分层将人群分为不同组别，高风险亚组的12个月进展率是低风险组的4.7倍。选定的特征——特别是腰围、ALT、GGT、甘油三酯、空腹血糖和BMI——与已建立的代谢风险因素一致，提供了生物学合理性。

英文摘要

Non-alcoholic fatty liver disease (NAFLD) affects roughly 25% of global adults, posing substantial hepatic and cardiovascular risks. Yet, population-level screening tools remain inadequate. We present Method, a machine-learning framework for NAFLD risk prediction coupling gradient-boosted decision trees with conformal prediction to yield calibrated, distribution-free coverage guarantees on individual risk estimates. It integrates a mutual-information-based stability selection procedure to identify a compact, clinically interpretable feature subset via bootstrap resampling, constructing prediction sets whose marginal coverage provably exceeds a user-specified confidence level. We evaluated Method on a multicenter cohort from Guangzhou, China (primary n=2,187; external validation n=412) using 78 candidate features across demographics, metabolic biomarkers, and lifestyle factors. Method achieves an AUROC of 0.912 internally and 0.891 externally, outperforming deep neural networks, TabNet, support vector machines, and logistic regression. Conformal prediction sets achieve 91.3% empirical coverage at the 90% nominal level. A three-tier risk stratification derived from these scores separates the population into distinct groups, with the high-risk subgroup showing a 12-month progression rate 4.7 times that of the low-risk tier. The selected features -- notably waist circumference, ALT, GGT, triglycerides, fasting glucose, and BMI -- align with established metabolic risk factors, providing biological plausibility.

URL PDF HTML ☆

赞 0 踩 0

2606.09856 2026-06-10 cs.CL cs.AI cs.LG stat.ML 新提交

Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models

使用概率程序训练大型语言模型的归纳推理

Liyi Zhang, Akshay K. Jagadish, Brenden M. Lake, Thomas L. Griffiths

AI总结提出基于程序的后验训练（PPT）方法，利用LLM生成概率程序场景，通过推理产生分布目标，微调模型以提升归纳推理准确性、与人类判断的一致性及校准能力。

Comments 20 pages, 5 figures

详情

AI中文摘要

大型语言模型（LLM）的后训练推理通常专注于数学和编码等演绎任务，其中正确性可验证。然而，许多现实世界的推理问题是归纳性的：智能体必须从稀疏、模糊的观测中推断不确定的信念。使用标准微调方法进行归纳推理面临挑战，包括难以策划大规模、高质量标注数据集以及处理本质上是分布式的目标。在这项工作中，我们引入了一种称为基于程序的后验训练（PPT）的新方法来解决这些局限性：我们使用LLM生成多样化的开放世界场景作为概率程序，运行概率推理以产生查询的分布式目标响应，然后在这些概率软标签上进行微调。使用这种方法，我们在10,000个程序生成的场景上微调LLM，并在保留的模板、人工标注的判断和外部基准上进行评估。总体而言，PPT显著提高了保留归纳任务的估计准确性，增强了与人类判断的一致性，并迁移到估计和校准的外部基准。此外，原始校准的增益并未被事后温度缩放所涵盖，表明与输出重新缩放相比，模型更深入地内化了不确定性。这些结果表明，概率程序介导的微调是一种有前景的方法，用于后训练LLM以可靠地执行近似归纳推理。

英文摘要

Post-training Large Language Models (LLMs) for reasoning typically focuses on deductive tasks such as mathematics and coding where correctness is verifiable. Yet, many real-world reasoning problems are inductive: agents must infer uncertain beliefs from sparse, ambiguous observations. There are challenges to using standard fine-tuning methods for inductive reasoning, including difficulties in curating large-scale, high-quality labeled datasets and in handling targets that are inherently distributional. In this work, we introduce a novel approach, called Program-based Posterior Training (PPT), to address these limitations: we use an LLM to generate diverse open-world scenarios as probabilistic programs, run probabilistic inference to produce distributional target responses to queries, and then fine-tune on these probabilistic soft labels. Using this approach, we fine-tune LLMs on 10,000 programmatically generated scenarios and evaluate on held-out motifs, human-labeled judgments, and external benchmarks. Overall, PPT substantially improves estimation accuracy on held-out inductive tasks, increases alignment with human judgments, and transfers to external benchmarks for estimation and calibration. Additionally, the gains in raw calibration are not subsumed by post-hoc temperature scaling, showing that the models have more deeply internalized uncertainty compared to output rescaling. Together, these results suggest that probabilistic-program-mediated fine-tuning is a promising approach for post-training LLMs to reliably perform approximate inductive inference.

URL PDF HTML ☆

赞 0 踩 0

2606.09854 2026-06-10 cs.CL cs.AI cs.CY cs.LG 新提交

Can Multi-Agent LLMs Identify Their Peers? Stylometric Fingerprinting in Role-Constrained Political Analysis

多智能体大语言模型能否识别其同类？角色约束政治分析中的笔迹风格指纹识别

Juergen Dietrich

AI总结研究多智能体LLM在政治分析中能否通过笔迹风格识别模型家族，提出SD-CV协议，T5模型在五类归属任务中达到F1=0.991，证明提示级匿名化无法消除模型身份信号。

Comments 24 pages, 3 figures

详情

AI中文摘要

用于政治声明分析的多智能体大语言模型（LLM）管道容易受到同伴保护偏见的影响：模型倾向于保护同伴模型免于停用，并表现出依赖身份的评分扭曲。提示级匿名化被提出作为缓解措施，但先前的工作同时记录了在角色约束输出中笔迹风格指纹在匿名化后仍然存在——这引发了该缓解措施是否足够的问题。本文首次系统研究LLM是否能在匿名化条件下识别政治分析文本背后的模型家族。我们评估了三种分类器方法——LLM零样本和少样本（Claude Sonnet 4.6和Llama-3.3-70B）以及微调的T5-base模型——在一个涵盖四个商业LLM家族和一个开放世界“未知”类的五类归属任务上。我们引入了一种声明不相交的交叉验证协议（SD-CV；定义见第3.5节），该协议保证训练和验证数据之间没有内容重叠，并将其与运行不相交的基线（RD-CV）进行对比。T5在SD-CV下达到Macro F1 = 0.991（±0.008），在24个完全保留的声明上F1 = 0.978——尽管与RD-CV相比，训练-测试内容距离增加了2.1倍（0.767 vs. 0.366，p<0.001），但仍表现出稳健性，证明了真正的笔迹风格泛化能力。一项分数SD-CV分析确定了训练数据40%（约440篇文本）处的性能拐点。我们的研究结果证实，仅靠提示级匿名化无法消除模型身份信号，这对欧盟AI法案合规性（第13、14、26条）以及质量关键型多智能体部署中的计算机系统验证（CSV）具有直接影响。

英文摘要

Multi-agent large language model (LLM) pipelines for political statement analysis are vulnerable to peer-preservation bias: models tend to protect peer models from deactivation and show identity-dependent scoring distortions. Prompt-level anonymization was proposed as a mitigation, but prior work simultaneously documented that stylometric fingerprints survive anonymization in role-constrained outputs - raising the question of whether this mitigation is sufficient. This paper provides the first systematic investigation of whether LLMs can identify the model family behind political analysis texts under anonymization conditions. We evaluate three classifier approaches - LLM zero-shot and few-shot (Claude Sonnet 4.6 and Llama-3.3-70B) and a fine-tuned T5-base model - on a five-class attribution task covering four commercial LLM families and an open-world 'unknown' class. We introduce a statement-disjoint cross-validation protocol (SD-CV; defined in Section 3.5) that guarantees no content overlap between training and validation data, and contrast it with a run-disjoint baseline (RD-CV). T5 achieves Macro F1 = 0.991 (+-0.008) under SD-CV and F1 = 0.978 on 24 completely held-out statements - robust despite a 2.1x increase in train-test content distance versus RD-CV (0.767 vs. 0.366, p<0.001), demonstrating genuine stylometric generalization. A fractional SD-CV analysis identifies a performance knee at 40% of training data (~440 texts). Our findings confirm that prompt-level anonymization alone cannot neutralize model identity signals, with direct implications for EU AI Act compliance (Articles 13, 14, 26) and for computer system validation (CSV) in quality-critical multi-agent deployments.

URL PDF HTML ☆

赞 0 踩 0