arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2510.02174 2026-05-28 cs.LG math.OC math.PR stat.ML

Flatness-Aware Stochastic Gradient Langevin Dynamics

平坦感知随机梯度Langevin动力学

Stefano Bruno, Youngsik Hwang, Jaehyeon An, Sotirios Sabanis, Dong-Young Lim

AI总结提出平坦感知随机梯度Langevin动力学（fSGLD），通过理论规定的噪声尺度与逆温度耦合，在保持计算效率的同时偏向平坦盆地，并提供非渐近理论分析和实验验证。

Comments Accepted by ICML 2026

详情

Journal ref: ICML 2026

AI中文摘要

损失景观的平坦性已被广泛研究，作为理解深度学习算法行为和泛化的重要视角。受此观点启发，我们提出了平坦感知随机梯度Langevin动力学（fSGLD），这是一种一阶优化方法，在保持SGD和SGLD的计算和内存效率的同时，使其动力学偏向平坦盆地。我们提供了非渐近理论分析，表明在理论上规定的噪声尺度$σ$和逆温度$β$之间的耦合下，fSGLD以平坦偏差的吉布斯分布为目标，并给出了显式的过剩风险保证。我们在标准优化器基准、贝叶斯图像分类、不确定性量化和分布外检测上对fSGLD进行了实证评估，展示了持续强劲的性能和可靠的不确定性估计。额外实验证实了理论上规定的$β$-$σ$耦合相对于解耦选择的有效性。

英文摘要

Flatness of the loss landscape has been widely studied as an important perspective for understanding the behavior and generalization of deep learning algorithms. Motivated by this view, we propose Flatness-Aware Stochastic Gradient Langevin Dynamics (fSGLD), a first-order optimization method that biases learning its dynamics toward flat basins while retaining the computational and memory efficiency of SGD and SGLD. We provide a non-asymptotic theoretical analysis showing that fSGLD targets a flatness-biased Gibbs distribution under a theoretically prescribed coupling between the noise scale $σ$ and the inverse temperature $β$, together with explicit excess risk guarantees. We empirically evaluate fSGLD across standard optimizer benchmarks, Bayesian image classification, uncertainty quantification, and out-of-distribution detection, demonstrating consistently strong performance and reliable uncertainty estimates. Additional experiments confirm the effectiveness of the theoretically prescribed $β$-$σ$ coupling compared to decoupled choices.

URL PDF HTML ☆

赞 0 踩 0

2509.23074 2026-05-28 cs.LG cs.AI

Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting

超越模型排名：时间序列预测的可预测性对齐评估

Wanjin Feng, Yuan Yuan, Jingtao Ding, Yong Li

AI总结针对基准排行榜评估混淆模型性能与数据内在不可预测性的问题，提出基于谱相干的可预测性对齐诊断框架，包含SCP分数和LUR工具，揭示可预测性漂移和模型架构权衡。

详情

AI中文摘要

在时间序列预测的AI模型日益复杂的时代，进展通常通过基准排行榜上的边际改进来衡量。然而，这种方法存在一个根本缺陷：标准评估指标混淆了模型的性能与数据的内在不可预测性。为了解决这一紧迫挑战，我们引入了一个新颖的、基于谱相干的可预测性对齐诊断框架。我们的框架有两个主要贡献：谱相干可预测性（SCP），一个计算高效（$O(N\log N)$）且任务对齐的分数，用于量化给定预测实例的固有难度；以及线性利用率（LUR），一个频率分辨的诊断工具，精确测量模型如何有效利用数据中的线性可预测信息。我们验证了框架的有效性，并利用它揭示了两个核心见解。首先，我们提供了“可预测性漂移”的首个系统性证据，表明任务的预测难度随时间剧烈变化。其次，我们的评估揭示了一个关键的架构权衡：复杂模型在低可预测性数据上表现优越，而线性模型在更可预测的任务上非常有效。我们倡导范式转变，超越简单的聚合分数，转向更具洞察力的、可预测性感知的评估，从而促进更公平的模型比较和更深入的模型行为理解。

英文摘要

In the era of increasingly complex AI models for time series forecasting, progress is often measured by marginal improvements on benchmark leaderboards. However, this approach suffers from a fundamental flaw: standard evaluation metrics conflate a model's performance with the data's intrinsic unpredictability. To address this pressing challenge, we introduce a novel, predictability-aligned diagnostic framework grounded in spectral coherence. Our framework makes two primary contributions: the Spectral Coherence Predictability (SCP), a computationally efficient ($O(N\log N)$) and task-aligned score that quantifies the inherent difficulty of a given forecasting instance, and the Linear Utilization Ratio (LUR), a frequency-resolved diagnostic tool that precisely measures how effectively a model exploits the linearly predictable information within the data. We validate our framework's effectiveness and leverage it to reveal two core insights. First, we provide the first systematic evidence of "predictability drift", demonstrating that a task's forecasting difficulty varies sharply over time. Second, our evaluation reveals a key architectural trade-off: complex models are superior for low-predictability data, whereas linear models are highly effective on more predictable tasks. We advocate for a paradigm shift, moving beyond simplistic aggregate scores toward a more insightful, predictability-aware evaluation that fosters fairer model comparisons and a deeper understanding of model behavior.

URL PDF HTML ☆

赞 0 踩 0

2602.01203 2026-05-28 cs.CL cs.LG

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

注意力汇聚在注意力层中锻造原生MoE：针对头部坍塌的汇聚感知训练

Zizhuo Fu, Wenxuan Zeng, Runsheng Wang, Meng Li

AI总结本文通过理论和实证证明注意力汇聚自然构建了注意力层内的混合专家机制，并提出汇聚感知训练算法以缓解头部坍塌问题，提升模型性能。

Comments 2026 International Conference on Machine Learning (ICML)

详情

AI中文摘要

大型语言模型（LLMs）通常将不成比例的注意力分配给第一个标记，这种现象称为注意力汇聚。最近的几种方法旨在解决这个问题，包括GPT-OSS中的汇聚注意力和Qwen3-Next中的门控注意力。然而，缺乏对这些注意力机制之间关系的全面分析。在这项工作中，我们提供了理论和实证证据，表明普通注意力和汇聚注意力中的汇聚自然地在注意力层内构建了混合专家（MoE）机制。这一见解解释了先前工作中观察到的头部坍塌现象，即只有固定子集的注意力头对生成有贡献。为了缓解头部坍塌，我们提出了一种汇聚感知训练算法，该算法带有专为注意力层设计的辅助负载平衡损失。大量实验表明，我们的方法在普通注意力、汇聚注意力和门控注意力上实现了有效的头部负载平衡，并提高了模型性能。我们希望这项研究能为注意力机制提供新的视角，并鼓励进一步探索注意力层内固有的MoE结构。

英文摘要

Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. Extensive experiments show that our method achieves effective head load balancing and improves model performance across Vanilla Attention, Sink Attention, and Gated Attention. We hope this study offers a new perspective on attention mechanisms and encourages further exploration of the inherent MoE structure within attention layers.

URL PDF HTML ☆

赞 0 踩 0

2512.14340 2026-05-28 cs.RO

Field evaluation and optimization of a lightweight autonomous lidar-based UAV system based on a rigorous experimental setup in boreal forest environments

基于严格实验设置的轻量级自主激光雷达无人机系统在北方森林环境中的现场评估与优化

Aleksi Karhunen, Teemu Hakala, Väinö Karjalainen, Eija Honkavaara

AI总结提出标准化实验设置评估自主林下无人机系统，通过轻量级激光雷达四旋翼在北方森林中的93次真实飞行验证，优化后系统在中难度森林中1m/s和2m/s速度下成功率分别为12/15和15/15，在困难森林中为12/15和5/15。

Comments This work has been submitted to the IEEE for possible publication

详情

DOI: 10.1109/TFR.2026.3691711

AI中文摘要

近年来，利用自主无人机进行林下森林遥感引起了越来越多的兴趣，导致科学文献中发表了大量自主飞行算法。为了支持此类算法的选择和开发，基于已发表研究对现有方法进行可靠比较至关重要。然而，由于实验设置差异很大且报告实践不完整，目前可靠比较面临挑战。本研究提出了一种标准化的实验设置，用于评估自主林下无人机系统，以填补这一空白。所提出的设置强调森林复杂性的定量报告、测试环境的可视化表示、多次重复飞行的执行，以及飞行成功率与定性飞行结果的报告。此外，鼓励在多个目标速度下飞行，并报告实际飞行速度、任务完成时间和点对点飞行距离。该设置通过采用最先进开源算法的轻量级激光雷达四旋翼进行演示，并在两个天然北方森林环境中进行了大量实验评估。基于对原始系统的系统评估，引入了若干改进。随后对优化后的系统重复相同的实验协议，总共进行了93次真实世界飞行。优化后的系统在中难度森林中，目标飞行速度为1 m/s和2 m/s时分别实现了12/15和15/15的成功率，在困难森林中分别为12/15和5/15。采用所提出的实验设置将有助于基于文献的自主林下飞行系统比较，并支持未来基于无人机的森林机器人解决方案的系统性能改进。

英文摘要

Interest in utilizing autonomous uncrewed aerial vehicles (UAVs) for under-canopy forest remote sensing has increased in recent years, resulting in the publication of numerous autonomous flight algorithms in the scientific literature. To support the selection and development of such algorithms, a reliable comparison of existing approaches based on published studies is essential. However, reliable comparisons are currently challenging due to widely varying experimental setups and incomplete reporting practices. This study proposes a standardized experimental setup for evaluating autonomous under-canopy UAV systems to fill this gap. The proposed setup emphasizes quantitative reporting of forest complexity, visual representation of test environments, execution of multiple repeated flights, and reporting of flight success rates alongside qualitative flight results. In addition, flights at multiple target speeds are encouraged, with reporting of realized flight speed, mission completion time, and point-to-point flight distance. The proposed setup is demonstrated using a lightweight lidar-based quadrotor employing state-of-the-art open-source algorithms, evaluated through extensive experiments in two natural boreal forest environments. Based on a systematic evaluation of the original system, several improvements were introduced. The same experimental protocol was then repeated with the optimized system, resulting in a total of 93 real-world flights. The optimized system achieved success rates of 12/15 and 15/15 at target flight speeds of 1 m/s and 2 m/s, respectively, in a medium-difficulty forest, and 12/15 and 5/15 in a difficult forest. Adoption of the proposed experimental setup would facilitate the literature-based comparison of autonomous under-canopy flight systems and support systematic performance improvement of future UAV-based forest robotics solutions.

URL PDF HTML ☆

赞 0 踩 0

2601.23262 2026-05-28 cs.LG

Particle-Guided Diffusion Models for Partial Differential Equations

粒子引导的偏微分方程扩散模型

Andrew Millard, Fredrik Lindsten, Zheng Zhao

AI总结提出一种粒子引导的随机采样方法，结合扩散模型与基于PDE残差和观测约束的物理引导，通过序贯蒙特卡洛框架实现可扩展的生成式PDE求解器，在多个基准和多物理场系统中数值误差低于现有方法。

2510.08525 2026-05-28 cs.CL

Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

哪些注意力头对推理重要？RL引导的KV缓存压缩

Wenjie Du, Li Jiang, Keda Tao, Xue Liu, Huan Wang

AI总结提出RLKV方法，利用强化学习识别对推理质量关键的注意力头，并对其保留完整KV缓存而对其他头进行激进压缩，实现20-60%缓存减少且性能近乎无损。

详情

AI中文摘要

推理型大语言模型通过扩展的思维链生成展现出复杂的推理行为，这些行为在解码过程中对信息损失高度敏感，给KV缓存压缩带来了关键挑战。现有的token丢弃方法通过移除中间步骤直接破坏推理链，而为检索任务设计的头重分配方法无法保留对生成推理至关重要的注意力头。然而，现有方法均无法识别哪些注意力头真正维持推理一致性并控制生成终止。为解决此问题，我们提出RLKV，它使用强化学习作为探针，通过直接优化注意力头缓存使用与实际生成结果的关系，发现哪些头对推理质量有贡献。这一发现自然引出了高效的压缩策略：我们对推理关键的头分配完整KV缓存，同时对其他头使用固定大小的KV缓存进行激进压缩。实验表明，少数头对推理至关重要，使得在多种任务和模型上实现20-60%的缓存减少且性能近乎无损，在60%压缩率下实现高达2.06倍的端到端加速。

英文摘要

Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Existing token-dropping methods directly disrupt reasoning chains by removing intermediate steps, while head-reallocation methods, designed for retrieval tasks, fail to preserve the heads essential for generative reasoning. However, no existing method can identify which attention heads genuinely maintain reasoning consistency and control generation termination. To address this, we propose RLKV, which uses reinforcement learning as a probe to discover which heads contribute to reasoning quality by directly optimizing their cache usage against actual generation outcomes. This discovery naturally leads to an efficient compression strategy: we allocate full KV cache to reasoning-critical heads while aggressively compressing others with constant-size KV cache. Experiments reveal that a fraction of heads proves essential for reasoning, enabling 20--60% cache reduction with near-lossless performance across diverse tasks and models, and up to 2.06x end-to-end speedup at 60% reduction.

URL PDF HTML ☆

赞 0 踩 0

2507.16679 2026-05-28 cs.CL cs.AI cs.CY

PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

PICACO: 通过总相关优化实现大语言模型的多元情境价值对齐

Han Jiang, Dongyao Zhu, Xiaoyuan Yi, Ziang Xiao, Zhihua Wei, Xing Xie

AI总结针对情境对齐中价值冲突导致的指令瓶颈问题，提出PICACO方法，通过优化元指令并最大化指定价值与模型响应的总相关，无需微调即可实现多元价值平衡对齐。

Comments ICML 2026

详情

AI中文摘要

情境学习在使大语言模型与人类价值对齐方面展现出巨大潜力，有助于减少有害输出并适应多样化偏好，而无需昂贵的后训练，这被称为情境对齐。然而，大语言模型对输入提示的理解仍是不可知的，限制了情境对齐处理价值冲突的能力——人类价值本质上是多元的，常常施加相互冲突的要求，例如刺激与传统。因此，当前的情境对齐方法面临指令瓶颈挑战，即大语言模型难以在单个提示中协调多个预期价值，导致对齐不完整或有偏。为了解决这个问题，我们提出了PICACO，一种新颖的多元情境对齐方法。无需微调，PICACO优化一个融合了多个价值的元指令，以更好地激发大语言模型对这些价值的理解并改进对齐。这是通过最大化指定价值与大语言模型响应之间的总相关来实现的，这从理论上强化了价值一致性并减少了干扰噪声，从而产生更有效的指令。在五个价值集上的大量实验表明，PICACO在黑盒和开源大语言模型上均表现良好，优于多个近期强基线，并在多达8个不同价值之间实现了更好的平衡。

英文摘要

In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions--human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that incorporates multiple values to better elicit LLMs' understanding of them and improve alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, which theoretically reinforces value conformity and reduces distractive noise, resulting in more effective instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.

URL PDF HTML ☆

赞 0 踩 0

2601.21666 2026-05-28 cs.AI cs.CV

SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

SONIC-O1：用于评估多模态大语言模型在音视频理解上的真实世界基准

Ahmed Y. Radwan, Christos Emmanouilidis, Hina Tabassum, Deval Pandya, Shaina Raza

AI总结提出SONIC-O1基准，包含60小时人工验证的音视频数据，评估多模态大语言模型在开放摘要、多项选择问答和时序定位上的能力，发现模型在时序定位上存在显著性能差距和人口统计偏差。

详情

AI中文摘要

多模态大语言模型（MLLMs）是近期AI研究的主要焦点。然而，大多数先前工作集中于静态图像理解，而它们处理序列音视频数据的能力仍未充分探索。这一差距凸显了需要一个高质量基准来系统评估MLLM在真实世界场景中的性能。我们介绍了SONIC-O1，一个全面的、完全人工验证的基准，包含60小时（231个片段）跨越13个真实世界对话领域的数据，带有4,958个注释和人口统计元数据。SONIC-O1评估三种能力：开放摘要、多项选择题（MCQ）回答以及带有支持理由（推理）的时序定位。在闭源和开源模型中，我们发现MCQ准确率显示模型家族之间的差距最小，但最好的闭源模型在时序定位上比最好的开源模型高出22.6%。我们进一步观察到不同人口统计组在时序定位上的准确率差距高达21.4%，表明模型行为存在持续差异。SONIC-O1为基于时序和人口统计鲁棒的多模态理解提供了一个开放评估套件。SONIC-O1公开可用于研究：项目页面（https://vectorinstitute.github.io/sonic-o1/）、数据集（https://huggingface.co/datasets/vector-institute/sonic-o1）、GitHub（https://github.com/vectorinstitute/sonic-o1）、排行榜（https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard）。

英文摘要

Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark of 60 hours (231 clips) spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates three capabilities: open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Across closed- and open-source models, we find that the MCQ accuracy shows the smallest gap between model families, but the best closed-source model outperforms the best open-source model by 22.6% on temporal localization. We further observe accuracy gaps of up to 21.4% on temporal localization across demographic groups, indicating persistent disparities in model behaviour. SONIC-O1 provides an open evaluation suite for temporally grounded and demographically robust multimodal understanding. SONIC-O1 is publicly available for research: Project page (https://vectorinstitute.github.io/sonic-o1/), Dataset (https://huggingface.co/datasets/vector-institute/sonic-o1), GitHub (https://github.com/vectorinstitute/sonic-o1), Leaderboard (https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard).

URL PDF HTML ☆

赞 0 踩 0

2601.21167 2026-05-28 cs.LG

Learning What to Recommend: Minimax Optimal Simple Regret in Logistic Bandits

学习推荐什么：逻辑斯蒂老虎机中极小化最优简单遗憾

Shuai Liu, Alireza Bakhtiari, Alex Ayoub, Botao Hao, Csaba Szepesvári

AI总结针对简单遗憾目标下的随机逻辑斯蒂老虎机，提出两种曲率感知算法（MULog和THATS），实现与下界匹配的遗憾上界，并揭示最优动作处sigmoid逆斜率κ_*决定极小化难度。

详情

AI中文摘要

我们研究在简单遗憾目标下具有$d$维动作特征的随机逻辑斯蒂老虎机，其中学习者使用$T$轮探索输出单个最终动作。逻辑斯蒂结构在此至关重要：因为动作的信息量取决于sigmoid的局部曲率，对即时奖励最优的动作不一定对识别最佳最终推荐最有用。我们表明一阶极小化难度由$κ_*$（sigmoid在最优动作处的逆斜率）主导。下界由一个移位饱和困难族实现，其中饱和同时限制了关于最终决策的可用信息，并控制了错误推荐的价值损失。这揭示了一种与累积遗憾构造不同的困难机制，尽管在线到批处理归约在期望上恢复了相同的领先阶。然后我们开发了两种曲率感知算法：\MULog，一种纯探索方法，其最终推荐满足阶为$ ilde O(d/\sqrt{κ_* T})$的高概率上界，与下界匹配至对数因子；以及\THATS，一种汤普森采样风格的方法，提供了计算上更轻的替代方案。在困难和简单几何上的实验支持相同的图景：信息性低奖励动作可以使实例显著更容易，而曲率感知方法特别有效地利用了这种结构。

英文摘要

We study stochastic logistic bandits with $d$-dimensional action features under the simple-regret objective, where a learner uses $T$ rounds of exploration to output a single final action. The logistic structure is essential here: because the informativeness of an action depends on the local curvature of the sigmoid, actions that are best for immediate reward need not be the most useful for identifying the best final recommendation. We show that the first-order minimax difficulty is governed by $κ_*$, the inverse slope of the sigmoid at the optimal action. The lower bound is realized by a shifted saturated hard family in which saturation simultaneously limits the information available about the final decision and controls the value loss from a wrong recommendation. This reveals a hard mechanism distinct from cumulative-regret constructions, even though online-to-batch reductions recover the same leading order in expectation. We then develop two curvature-aware algorithms: \MULog, a pure-exploration method whose final recommendation satisfies a high-probability upper bound of order $\tilde O(d/\sqrt{κ_* T})$, matching the lower bound up to logarithmic factors, and \THATS, a Thompson-sampling-style method that provides a computationally lighter alternative. Experiments on both hard and easy geometries support the same picture: informative low-reward actions can make instances substantially easier, and the curvature-aware methods exploit this structure especially effectively.

URL PDF HTML ☆

赞 0 踩 0

2510.11234 2026-05-28 cs.LG

Neural Weight Compression for Language Models

语言模型的神经权重压缩

Jegwang Ryu, Minkyu Kim, Seungjun Shin, Hee Min Choi, Dokwan Oh, Jaeho Lee

AI总结提出神经权重压缩（NWC）框架，通过训练神经编解码器在预训练权重数据集上实现高效压缩，解决张量异质性和重建损失与下游性能不匹配问题，在4-6比特区间取得优异精度-压缩权衡。

详情

AI中文摘要

随着模型规模和部署的增长，语言模型权重的高效压缩变得越来越关键。然而，现有大多数方法依赖于手工设计的变换和启发式方法，反映出对权重作为数据模态的理解有限。为了超越这一范式，我们将权重压缩公式化为神经编解码器学习，并提出了神经权重压缩（NWC），一个在预训练权重数据集上训练神经编解码器的框架。NWC解决了权重压缩固有的挑战，包括张量异质性和重建损失与下游性能之间的不匹配。实验表明，NWC实现了极具竞争力的精度-压缩权衡，在4-6比特区间内尤其强劲，且不依赖刚性的手工设计组件（如Hadamard变换）。这些优势扩展到不同架构，例如视觉编码器。我们的分析强调了熵约束量化和学习变换在使压缩适应权重数据和下游任务中的作用。

英文摘要

Efficient compression of language model weights is increasingly critical as model scale and deployment grow. Yet, most existing methods rely on handcrafted transforms and heuristics, reflecting the limited understanding of weights as a data modality. To move beyond this paradigm, we formulate weight compression as neural codec learning and propose Neural Weight Compression (NWC), a framework for training neural codecs on pretrained weight datasets. NWC addresses challenges intrinsic to weight compression, including tensor heterogeneity and the mismatch between reconstruction losses and downstream performance. Experiments show that NWC achieves highly competitive accuracy-compression tradeoffs, with particularly strong results in the 4-6 bit regime, without relying on rigid handcrafted components such as the Hadamard transform. These gains extend to across diverse architectures, e.g., vision encoders. Our analysis highlights the roles of entropy-constrained quantization and learned transforms in adapting compression to weight data and downstream tasks.

URL PDF HTML ☆

赞 0 踩 0

2601.19926 2026-05-28 cs.CL cs.AI

The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models

Transformer的语法：语言模型中句法知识可解释性研究的系统综述

Nora Graichen, Iria de-Dios-Flores, Gemma Boleda

AI总结通过对337篇文章的系统综述，评估基于Transformer的语言模型（TLM）的句法能力，发现TLM编码了非平凡的句法知识，但句法-语义接口现象表现较弱，且研究集中在英语和BERT类模型上。

详情

AI中文摘要

我们对337篇评估基于Transformer的语言模型（TLM）句法能力的文章进行了系统综述，报告了涵盖广泛句法现象、语言、模型和方法的3000多个数据点。这些数据共同表明，TLM编码了非平凡的句法知识。行为证据显示，TLM在形式句法现象上表现强劲，但在句法-语义接口现象上表现较弱且多变。对于数字支持较少的语言，表现也持续较低。探针和机制研究进一步支持TLM中存在句法知识。然而，由于大多数工作仍停留在观察层面，且当前方法在方法论上具有异质性，对句法处理背后的详细计算机制的洞察仍然有限。同时，文献仍然高度集中在英语和BERT类模型上。我们讨论了研究结果的意义，并为未来研究提供了建议。

英文摘要

We present a systematic review of 337 articles evaluating the syntactic abilities of Transformer-based language models (TLMs), reporting on over 3,000 datapoints spanning a wide range of syntactic phenomena, languages, models, and methods. We take the data to collectively show that TLMs encode a non-trivial amount of syntactic knowledge. Behavioral evidence shows strong performance on formal syntactic phenomena, but weaker and more variable performance on phenomena at the syntax-semantics interface. Performance is also consistently lower for languages with less digital support. Probing and mechanistic studies further support the presence of syntactic knowledge in TLMs. Yet, because most work remains observational and current approaches are methodologically heterogeneous, insight into the detailed computational mechanisms underlying syntactic processing remains limited. At the same time, the literature remains heavily concentrated on English and BERT-like models. We discuss the implications of our results and provide recommendations for future research.

URL PDF HTML ☆

赞 0 踩 0

2601.08131 2026-05-28 cs.CL

Attention Projection Mixing with Exogenous Anchors

基于外生锚点的注意力投影混合

Jonathan Su

AI总结针对早期注意力投影跨层重用中内部锚点设计存在的结构冲突，提出ExoFormer模型，通过学习序列层外的外生锚点投影，并引入统一归一化混合框架，在减少令牌使用量的同时提升下游准确率。

详情

AI中文摘要

早期注意力投影的跨层重用可以改善优化和数据效率，但它造成了一个结构冲突：第一层必须同时作为所有更深层的稳定、可重用的锚点和有效的计算块。我们证明这种张力限制了内部锚点设计的性能。我们提出ExoFormer，通过在序列层堆栈之外学习外生锚点投影来解决这一冲突。我们引入了一个统一的归一化混合框架，该框架使用可学习的系数（探索系数粒度：元素级、头级和标量级）混合查询、键、值和门控对数，并表明归一化锚点源是稳定重用的关键。ExoFormer变体始终优于其内部锚点对应物，动态变体在匹配验证损失的情况下，使用比Gated Attention少1.5倍的令牌，获得1.5倍的下游准确率。我们通过卸载假说解释这种有效性：外部锚点保留必要的令牌身份，使层能够专门专注于特征变换。我们发布代码和模型以促进未来研究。

英文摘要

Cross-layer reuse of early attention projections can improve optimization and data efficiency, but it creates a structural conflict: the first layer must simultaneously act as a stable, reusable anchor for all deeper layers and as an effective computational block. We demonstrate that this tension constrains the performance of internal-anchor designs. We propose ExoFormer, which resolves the conflict by learning exogenous anchor projections outside the sequential layer stack. We introduce a unified normalized mixing framework that mixes queries, keys, values, and gate logits using learnable coefficients (exploring coefficient granularities: elementwise, headwise, and scalar), and we show that normalizing anchor sources is key to stable reuse. ExoFormer variants consistently outperform their internal-anchor counterparts, and the dynamic variant yields 1.5x downstream accuracy points while matching validation loss using 1.5x fewer tokens than Gated Attention. We explain this efficacy via an Offloading Hypothesis: external anchors preserve essential token identity, allowing layers to specialize exclusively in feature transformation. We release code and models to facilitate future research.

URL PDF HTML ☆

赞 0 踩 0

2509.06350 2026-05-28 cs.CL cs.AI cs.CR

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Mask-GCG：对抗性后缀中的所有标记对于越狱攻击都是必要的吗？

Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang

AI总结提出Mask-GCG方法，通过可学习的标记掩码识别后缀中高影响力标记并剪枝低影响力标记，降低计算开销并保持攻击成功率，揭示LLM提示中的标记冗余。

Comments Accepted to ICASSP 2026

详情

DOI: 10.1109/ICASSP55912.2026.11462363
Journal ref: 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 13887-13891, 2026

AI中文摘要

针对大型语言模型（LLM）的越狱攻击已展示了多种成功方法，攻击者操纵模型生成其本应避免的有害响应。其中，贪婪坐标梯度（GCG）作为一种通用且有效的方法，通过优化后缀中的标记来生成可越狱的提示。尽管已提出多种GCG的改进变体，但它们都依赖于固定长度的后缀。然而，这些后缀中潜在的冗余尚未被探索。在这项工作中，我们提出Mask-GCG，一种即插即用的方法，采用可学习的标记掩码来识别后缀中的高影响力标记。我们的方法增加了高影响力位置标记的更新概率，同时剪枝低影响力位置的标记。这种剪枝不仅减少了冗余，还降低了梯度空间的大小，从而减少了计算开销，并缩短了实现成功攻击所需的时间。我们将Mask-GCG应用于原始GCG及其多种改进变体进行评估。实验结果表明，后缀中的大多数标记对攻击成功有显著贡献，剪枝少数低影响力标记不会影响损失值或攻击成功率（ASR），从而揭示了LLM提示中的标记冗余。我们的发现从越狱攻击的角度为开发高效且可解释的LLM提供了见解。

英文摘要

Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.

URL PDF HTML ☆

赞 0 踩 0

2601.17737 2026-05-28 cs.CV cs.AI

The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

脚本即一切：一个用于长程对话到电影视频生成的智能体框架

Chenyu Mu, Xin He, Qu Yang, Wanshun Chen, Jiadi Yao, Huang Liu, Zihao Yi, Bo Zhao, Xingyu Chen, Ruotian Ma, Fanghua Ye, Erkun Yang, Cheng Deng, Zhaopeng Tu, Xiaolong Li, Linus

AI总结提出一个端到端智能体框架，通过训练ScripterAgent将对话转化为精细脚本，并利用DirectorAgent跨场景连续生成策略，实现长程对话到电影视频的连贯生成，显著提升脚本忠实度和时间保真度。

详情

AI中文摘要

近期视频生成的进展产生了能够从简单文本提示合成惊艳视觉内容的模型。然而，这些模型难以从对话等高层概念生成连贯的长篇叙事，揭示了创意想法与其电影执行之间的“语义鸿沟”。为弥合这一鸿沟，我们引入了一个新颖的、端到端的智能体框架，用于对话到电影视频的生成。我们框架的核心是ScripterAgent，一个经过训练将粗略对话转化为精细、可执行的电影脚本的模型。为此，我们构建了ScriptBench，一个具有丰富多模态上下文的新大规模基准，通过专家引导的流程进行标注。生成的脚本随后指导DirectorAgent，它使用跨场景连续生成策略协调最先进的视频模型，以确保长程连贯性。我们的全面评估，包括一个AI驱动的CriticAgent和一个新的视觉-脚本对齐（VSA）指标，表明我们的框架在所有测试的视频模型上显著提高了脚本忠实度和时间保真度。此外，我们的分析揭示了当前SOTA模型在视觉奇观与严格脚本遵循之间的关键权衡，为自动化电影制作的未来提供了宝贵见解。

英文摘要

Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap'' between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.

URL PDF HTML ☆

赞 0 踩 0

2601.18116 2026-05-28 cs.CL

卷积神经网络逆问题求解器的解析理论

Minh Hai Nguyen, Quoc Bao Do, Edouard Pauwels, Pierre Weiss

AI总结通过最小均方误差估计器引入平移等变性和有限感受野的归纳偏置，推导出局部等变MMSE的解析公式，并在多种逆问题、数据集和架构上验证其与神经网络输出高度一致。

详情

Journal ref: Forty-Third International Conference on Machine Learning, 2026

AI中文摘要

监督卷积神经网络（CNN）被广泛用于解决成像逆问题，在众多应用中取得了最先进的性能。然而，尽管取得了经验上的成功，这些方法从理论角度仍缺乏理解，常被视为黑箱。为弥合这一差距，我们通过最小均方误差（MMSE）估计器的视角分析训练后的神经网络，并引入捕获CNN两个基本归纳偏置（平移等变性和通过有限感受野的局部性）的功能约束。在经验训练分布下，我们推导出这种约束变体（称为局部等变MMSE，LE-MMSE）的解析、可解释且易于计算的公式。通过在不同逆问题（去噪、修复、去卷积）、数据集（FFHQ、CIFAR-10、FashionMNIST）和架构（U-Net、ResNet、PatchMLP）上的大量数值实验，我们证明了我们的理论与神经网络输出相匹配（PSNR $\gtrsim25$dB）。此外，我们提供了对物理感知和物理无关估计器之间差异、训练（补丁）分布中高密度区域的影响以及其他因素（数据集大小、补丁大小等）影响的见解。

英文摘要

Supervised convolutional neural networks (CNNs) are widely used to solve imaging inverse problems, achieving state-of-the-art performance in numerous applications. However, despite their empirical success, these methods are poorly understood from a theoretical perspective and often treated as black boxes. To bridge this gap, we analyze trained neural networks through the lens of the Minimum Mean Square Error (MMSE) estimator, incorporating functional constraints that capture two fundamental inductive biases of CNNs: translation equivariance and locality via finite receptive fields. Under the empirical training distribution, we derive an analytic, interpretable, and tractable formula for this constrained variant, termed Local-Equivariant MMSE (LE-MMSE). Through extensive numerical experiments across various inverse problems (denoising, inpainting, deconvolution), datasets (FFHQ, CIFAR-10, FashionMNIST), and architectures (U-Net, ResNet, PatchMLP), we demonstrate that our theory matches the neural networks outputs (PSNR $\gtrsim25$dB). Furthermore, we provide insights into the differences between \emph{physics-aware} and \emph{physics-agnostic} estimators, the impact of high-density regions in the training (patch) distribution, and the influence of other factors (dataset size, patch size, etc).

URL PDF HTML ☆

赞 0 踩 0

2601.10085 2026-05-28 cs.CL

CALM-IT: Generating Realistic Long-Form Motivational Interviewing Dialogues with Dual-Actor Conversational Dynamics Tracking

CALM-IT: 通过双角色对话动态追踪生成逼真的长形式动机访谈对话

Viet Cuong Nguyen, Nhi Yen Nguyen, Kristin A. Candan, Mary Conlon, Vanessa Rumie, Kristen Risola, Michael L. Birnbaum, Munmun De Choudhury

AI总结提出CALM-IT框架，通过显式建模客户与咨询师状态的演变来生成和评估长形式动机访谈对话，在8,232个合成对话语料上优于基线方法，尤其在MITI 4.2全局评分和客户接受率上表现最佳。

Comments 53 pages, in submission to EMNLP

详情

AI中文摘要

治疗性对话并非孤立响应的序列：客户目标、动机、抵抗和治疗联盟随时间演变。然而，当前基于LLM的心理健康对话系统通常缺乏在长时间交互中追踪这些动态的显式机制，可能导致时机不当的干预或过早的目标解决。我们引入CALM-IT，一个通过显式建模客户和咨询师状态演变来生成和评估长形式动机访谈对话的框架，指导咨询策略选择和话语生成。我们在包含8,232个合成对话的大规模语料库上评估CALM-IT，涵盖多种对话长度和框架。与所有基线相比，CALM-IT在大多数MITI 4.2全局评分（包括共情、伙伴关系和软化维持谈话）以及其他关键性能指标上取得最佳性能，且随着对话长度增加性能下降最小。值得注意的是，尽管CALM-IT发起的改变导向提示较少，但在不同长度条件下平均客户接受率最高（64.3%）。我们发布了一个可复现的生成框架、一个基于MITI的过程级评估协议，以及一个大规模合成语料库，用于在逼真的长形式交互条件下研究治疗性LLM。

英文摘要

Therapeutic dialogue is not a sequence of isolated responses: client goals, motivation, resistance, and therapeutic alliance evolve over time. Yet current LLM-based mental health dialogue systems often lack explicit mechanisms for tracking these dynamics across extended interactions, which can lead to poorly timed interventions or premature goal resolution. We introduce CALM-IT, a framework for generating and evaluating long-form Motivational Interviewing dialogues through explicit modeling of evolving client and counselor states, guiding both counseling strategy selection and utterance generation. We evaluate CALM-IT on a large-scale corpus of 8,232 synthetic dialogues spanning multiple dialogue lengths and frameworks. Compared with all baselines, CALM-IT achieves the best performance on most MITI 4.2 global ratings, including Empathy, Partnership, and Softening Sustain Talk, as well as on other key performance metrics while exhibiting minimal performance degradation as dialogue length increases. Notably, although CALM-IT initiates fewer change-directed prompts, it produces the highest client acceptance rate (64.3%) on average across different length conditions. We release a reproducible generation framework, a MITI-grounded process-level evaluation protocol, and a large-scale synthetic corpus for studying therapeutic LLMs under realistic long-form interaction conditions.

URL PDF HTML ☆

赞 0 踩 0

2601.08617 2026-05-28 cs.CV

SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning

SoC: 测试时提示调优的语义正交校准

Leo Fillioux, Omprakash Chakraborty, Ismail Ben Ayed, Paul-Henry Cournède, Stergios Christodoulidis, Maria Vakalopoulou, Jose Dolz

AI总结针对视觉语言模型测试时提示调优中校准被忽视的问题，提出基于Huber的正则化方法SoC，在保持语义邻近性的同时实现平滑的原型分离，从而改善校准性能并保持判别能力。

详情

Journal ref: CVPR 2026

AI中文摘要

随着视觉语言模型（VLM）在医疗或自动驾驶等关键决策系统中的日益普及，对其不确定性估计的校准变得至关重要。然而，这一维度在VLM测试时提示调优（TPT）文献中尚未得到充分探索，该领域主要侧重于提升其判别性能。最近的最先进方法主张对文本提示嵌入对实施完全正交约束以增强可分离性，从而改善校准。然而，正如我们在理论上所示，完全正交约束的固有梯度会强烈地将语义相关的类别推开，最终使模型过度自信。基于我们的发现，我们提出了语义正交校准（SoC），一种基于Huber的正则化器，它在保持语义邻近性的同时实现平滑的原型分离，从而相比于先前的基于正交性的方法改善了校准。通过全面的实证验证，我们证明SoC在保持竞争性判别能力的同时，持续改善了校准性能。

英文摘要

With the increasing adoption of vision-language models (VLMs) in critical decision-making systems such as healthcare or autonomous driving, the calibration of their uncertainty estimates becomes paramount. Yet, this dimension has been largely underexplored in the VLM test-time prompt-tuning (TPT) literature, which has predominantly focused on improving their discriminative performance. Recent state-of-the-art advocates for enforcing full orthogonality over pairs of text prompt embeddings to enhance separability, and therefore calibration. Nevertheless, as we theoretically show in this work, the inherent gradients from fully orthogonal constraints will strongly push semantically related classes away, ultimately making the model overconfident. Based on our findings, we propose Semantic Orthogonal Calibration (SoC), a Huber-based regularizer that enforces smooth prototype separation while preserving semantic proximity, thereby improving calibration compared to prior orthogonality-based approaches. Across a comprehensive empirical validation, we demonstrate that SoC consistently improves calibration performance, while also maintaining competitive discriminative capabilities.

URL PDF HTML ☆

赞 0 踩 0

2601.07648 2026-05-28 cs.CL

What Are We Measuring in NLG? A Meta-Analysis of Evaluation Trends 2020-2025

我们在NLG中测量什么？2020-2025年评估趋势的元分析

Jing Yang, Nils Feldhus, Salar Mohtaj, Leonhard Hennig, Qianli Wang, Eleni Metheniti, Sherzod Hakimov, Charlott Jakob, Veronika Solopova, Konrad Rieck, David Schlangen, Sebastian Möller, Vera Schmitt

AI总结通过元分析14,171篇论文，揭示NLG评估中的三个系统性问题：度量惯性、度量-标准映射问题和验证差距，并提出最小评估清单。

Comments 8 pages

详情

AI中文摘要

随着自然语言生成（NLG）主导现代NLP，可扩展评估仍然是一个关键瓶颈。因此，LLM作为评判者（LaaJ）的采用迅速加速，在2025年出现在比人工评估更多的论文中。这一关键转变促使对当前评估实践进行批判性分析。克服刚性关键词过滤和人工审查的限制，我们采用多LLM信息提取管道，从四大NLP会议（2020-2025）的14,171篇论文中收集结构化元数据。分析3,334篇过滤后的NLG论文，我们识别出三个系统性问题。（1）度量惯性：尽管转向开放式生成，传统词汇度量（BLEU、ROUGE）仍作为主要指标持续存在，通常与语义替代方案一起使用而非被取代。（2）度量-标准映射问题：我们的论文级共现数据显示，通用自动度量被用作质量的广泛代理，而未指定它们旨在评估文本生成的哪个维度。（3）验证差距：LaaJ在没有相应人工验证的情况下快速增长（少于8%的论文）。关键的是，虽然LaaJ与整体质量相关，但在流畅性等细粒度标准上一致性崩溃。为弥补这些差距，我们将发现提炼为一个最小评估清单，以指导度量选择、构念效度和LaaJ部署。

英文摘要

As Natural Language Generation (NLG) dominates modern NLP, scalable evaluation remains a critical bottleneck. Consequently, LLM-as-a-judge (LaaJ) adoption has accelerated rapidly, appearing in more papers than human evaluation in 2025. This pivotal shift motivates a critical analysis of current evaluation practices. Overcoming the limits of rigid keyword filtering and manual review, we employ a multi-LLM information extraction pipeline to gather structured metadata from 14,171 papers across four major NLP conferences (2020-2025). Analyzing 3,334 filtered NLG papers, we identify three systemic challenges. (1) Metric inertia: despite the shift toward open-ended generation, legacy lexical metrics (BLEU, ROUGE) persist as primary indicators, typically used alongside rather than replaced by semantic alternatives. (2) Metric-criteria mapping problem: our paper-level co-occurrence data reveals that general-purpose automatic metrics are applied as broad proxies for quality, without specifying which dimension of text generation they are intended to evaluate. (3) Validation gap: LaaJ has grown rapidly without commensurate human validation (fewer than 8% of papers). Crucially, while LaaJ correlates with aggregate quality, alignment collapses on fine-grained criteria like fluency. To address these gaps, we distill our findings into a minimal Evaluation Checklist to guide metric selection, construct validity, and LaaJ deployment.

URL PDF HTML ☆

赞 0 踩 0

2601.06329 2026-05-28 cs.CL cs.AI

On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

论口语语言模型评估中全局令牌困惑度的谬误

Chan-Jan Hsu, Liang-Hsuan Tseng, Yi-Cheng Lin, Yen-Chun Kuo, Ju-Chieh Chou, Kai-Wei Chang, Hung-yi Lee, Carlos Busso

AI总结针对口语语言模型评估中直接使用文本困惑度公式计算语音令牌困惑度的问题，提出基于似然和生成的新型评估方法，更忠实反映生成质量，并缩小了最佳模型与人类基线之间的差距。

详情

AI中文摘要

在大规模原始音频上预训练的生成式口语语言模型能够以适当内容继续语音提示，同时保留说话人和情感等属性，作为口语对话的基础模型。在先前文献中，这些模型通常使用“全局令牌困惑度”进行评估，该指标直接将文本困惑度公式应用于语音令牌。然而，这种做法忽略了语音和文本模态之间的根本差异，可能导致对语音特性的低估。在这项工作中，我们提出了多种基于似然和生成的评估方法，以替代朴素的全局令牌困惑度。我们证明，所提出的评估更忠实地反映了感知生成质量，与人类评分的平均意见得分（MOS）具有更强的相关性。在新指标下评估时，口语语言模型的相对性能格局被重塑，揭示了最佳性能模型与人类基线之间的差距显著缩小。总之，这些结果表明，适当的评估对于准确评估口语语言建模的进展至关重要。

英文摘要

Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under the new metrics, the relative performance landscape of spoken language models is reshaped, revealing a significantly reduced gap between the best-performing model and the human topline. Together, these results suggest that appropriate evaluation is critical for accurately assessing progress in spoken language modeling.

URL PDF HTML ☆

赞 0 踩 0

2601.05386 2026-05-28 cs.AI

How Much Can a Few Engine Moves Help? Quantifying Limited Cheating in Chess

几次引擎走棋能有多大帮助？量化国际象棋中的有限作弊

Daniel Keren

AI总结本文通过阈值策略和Bellman策略，在Stockfish引擎对弈中量化有限次作弊对棋手得分的影响，并引入无引擎模拟器优化超参数。

Comments Accepted, IEEE CoG 2026 (IEEE Conference on Games 2026). Replaces previous version "On the Effect of Cheating in Chess"

2601.04876 2026-05-28 cs.SD

ChronosAudio: A Comprehensive Long-Audio Benchmark for Evaluating Audio-Large Language Models

ChronosAudio: 用于评估音频大语言模型的综合长音频基准

Kaiwen Luo, Liang Lin, Yibo Zhang, Moayad Aloqaily, Jialiang Tao, Dexian Wang, Zhenhong Zhou, Junwei Zhang, Kun Wang, Li Sun, Qingsong Wen

AI总结提出首个针对音频大语言模型长音频理解的多任务基准ChronosAudio，包含6大任务类别和36000个测试实例，实验发现模型存在长上下文崩溃、注意力稀释等问题，现有缓解策略仅恢复50%性能。

详情

AI中文摘要

尽管音频大语言模型（ALLMs）取得了显著进展，但其长音频理解能力仍未得到探索。针对通用音频任务，已有大量基准被提出，但它们主要关注短视频片段，缺乏评估ALLMs在长时间跨度上的共识。本文提出ChronosAudio，这是首个针对ALLMs长音频理解定制的多任务基准。它包含六大任务类别，共36000个测试实例，总时长超过200小时，并按短、中、长三类进行分层，以全面评估长度泛化能力。使用ChronosAudio对16个最先进模型进行的广泛实验得出了三个关键发现：1. 急剧的长上下文崩溃：ALLMs表现出严重的性能维持能力不足，从短上下文过渡到长上下文时，特定任务的性能下降超过90%。2. 结构性注意力稀释：性能下降源于维持时间局部性的根本失败；注意力机制在后续序列中遭受显著扩散。3. 缓解措施的效果上限：当前策略仅能恢复50%的性能。这些发现揭示了长音频中的重大挑战，强调了迫切需要实现稳健的文档级音频推理的方法。

英文摘要

Although Audio Large Language Models (ALLMs) have witnessed substantial advancements, their long audio understanding capabilities remain unexplored. A plethora of benchmarks have been proposed for general audio tasks, they predominantly focus on short-form clips, leaving without a consensus on evaluating ALLMs over extended durations. This paper proposes ChronosAudio, the first multi-task benchmark tailored for long-audio understanding in ALLMs. It encompasses six major task categories and comprises 36,000 test instances totaling over 200 hours audio, stratified into short, middle, and long-form categories to comprehensively evaluate length generalization. Extensive experiments on 16 state-of-the-art models using ChronosAudio yield three critical findings: 1.Precipitous Long-Context Collapse: ALLMs exhibit a severe inability to sustain performance, with the transition from short to long contexts triggering a staggering performance degradation of over 90% in specific tasks. 2.Structural Attention Dilution: Performance degradation stems from a fundamental failure in maintaining temporal locality; attention mechanisms suffer from significant diffusion in later sequences. 3.Restorative Ceiling of Mitigation: Current strategies only offer 50% recovery. These findings reveal significant challenges in long-audio, underscoring the urgent need for approaches to achieve robust, document-level audio reasoning.

URL PDF HTML ☆

赞 0 踩 0

2601.03549 2026-05-28 cs.CV cs.CL

FEA-SLT: A Gloss-Free End-to-End Framework for Facial-Expression-Aware Sign Language Translation

FEA-SLT：一种面向面部表情感知的手语翻译的无词汇端到端框架

Guobin Tu, Di Weng

AI总结提出FEA-SLT框架，通过面部表情感知融合模块利用面部动态作为语义锚点，解决无词汇手语翻译中手势歧义问题，在PHOENIX14T和CSL-Daily数据集上达到最优BLEU性能。

详情

AI中文摘要

手语翻译（SLT）是一项具有挑战性的跨模态任务，需要对手部动作和非手动信号进行联合建模。现有的无词汇SLT方法有效捕捉手势动态，但常常未充分利用面部表情，而面部表情在语法和消除歧义中起着关键作用。当不同概念共享相似手部配置时，这一限制可能导致语义退化。为解决此问题，我们提出FEA-SLT（面部表情感知手语翻译），一种无词汇端到端框架，利用面部动态作为语义锚点来消除手部歧义。FEA-SLT采用领域迁移的面部编码器提取表情敏感表示，并通过语言约束的面部表情感知融合（FEAF）模块将其与手部特征集成。FEAF通过双向调制捕捉手部和面部通道之间的相互依赖关系，增强句法保真度。在PHOENIX14T和CSL-Daily上的实验表明，FEA-SLT在无词汇方法中实现了最先进的BLEU性能，而针对性分析证实了其对面部敏感语句翻译的改进。代码可在[https://github.com/TuGuobin/FEA-SLT](https://github.com/TuGuobin/FEA-SLT)获取。

英文摘要

Sign Language Translation (SLT) is a challenging cross-modal task requiring joint modeling of manual articulations and non-manual signals. Existing gloss-free SLT methods effectively capture gestural dynamics but often underutilize facial expressions, which play crucial grammatical and disambiguating roles. This limitation can cause semantic degradation when distinct concepts share similar manual configurations. To address this issue, we propose FEA-SLT (**F**acial-**E**xpression-**A**ware **S**ign **L**anguage **T**ranslation), a gloss-free end-to-end framework that uses facial dynamics as semantic anchors for resolving manual ambiguity. FEA-SLT employs a domain-transferred facial encoder to extract expression-sensitive representations and integrates them with manual features through a linguistically constrained *Facial-Expression-Aware Fusion* (FEAF) module. FEAF captures reciprocal dependencies between manual and facial channels via bidirectional modulation, enhancing syntactic fidelity. Experiments on PHOENIX14T and CSL-Daily show that FEA-SLT achieves state-of-the-art BLEU performance among gloss-free methods, while targeted analyses confirm improved translation of facial-sensitive utterances. Code is available at [https://github.com/TuGuobin/FEA-SLT](https://github.com/TuGuobin/FEA-SLT).

URL PDF HTML ☆

赞 0 踩 0