arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2256
2510.02174 2026-05-28 cs.LG math.OC math.PR stat.ML

Flatness-Aware Stochastic Gradient Langevin Dynamics

平坦感知随机梯度Langevin动力学

Stefano Bruno, Youngsik Hwang, Jaehyeon An, Sotirios Sabanis, Dong-Young Lim

AI总结 提出平坦感知随机梯度Langevin动力学(fSGLD),通过理论规定的噪声尺度与逆温度耦合,在保持计算效率的同时偏向平坦盆地,并提供非渐近理论分析和实验验证。

Comments Accepted by ICML 2026

详情
Journal ref
ICML 2026
AI中文摘要

损失景观的平坦性已被广泛研究,作为理解深度学习算法行为和泛化的重要视角。受此观点启发,我们提出了平坦感知随机梯度Langevin动力学(fSGLD),这是一种一阶优化方法,在保持SGD和SGLD的计算和内存效率的同时,使其动力学偏向平坦盆地。我们提供了非渐近理论分析,表明在理论上规定的噪声尺度$σ$和逆温度$β$之间的耦合下,fSGLD以平坦偏差的吉布斯分布为目标,并给出了显式的过剩风险保证。我们在标准优化器基准、贝叶斯图像分类、不确定性量化和分布外检测上对fSGLD进行了实证评估,展示了持续强劲的性能和可靠的不确定性估计。额外实验证实了理论上规定的$β$-$σ$耦合相对于解耦选择的有效性。

英文摘要

Flatness of the loss landscape has been widely studied as an important perspective for understanding the behavior and generalization of deep learning algorithms. Motivated by this view, we propose Flatness-Aware Stochastic Gradient Langevin Dynamics (fSGLD), a first-order optimization method that biases learning its dynamics toward flat basins while retaining the computational and memory efficiency of SGD and SGLD. We provide a non-asymptotic theoretical analysis showing that fSGLD targets a flatness-biased Gibbs distribution under a theoretically prescribed coupling between the noise scale $σ$ and the inverse temperature $β$, together with explicit excess risk guarantees. We empirically evaluate fSGLD across standard optimizer benchmarks, Bayesian image classification, uncertainty quantification, and out-of-distribution detection, demonstrating consistently strong performance and reliable uncertainty estimates. Additional experiments confirm the effectiveness of the theoretically prescribed $β$-$σ$ coupling compared to decoupled choices.

2509.23074 2026-05-28 cs.LG cs.AI

Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting

超越模型排名:时间序列预测的可预测性对齐评估

Wanjin Feng, Yuan Yuan, Jingtao Ding, Yong Li

AI总结 针对基准排行榜评估混淆模型性能与数据内在不可预测性的问题,提出基于谱相干的可预测性对齐诊断框架,包含SCP分数和LUR工具,揭示可预测性漂移和模型架构权衡。

详情
AI中文摘要

在时间序列预测的AI模型日益复杂的时代,进展通常通过基准排行榜上的边际改进来衡量。然而,这种方法存在一个根本缺陷:标准评估指标混淆了模型的性能与数据的内在不可预测性。为了解决这一紧迫挑战,我们引入了一个新颖的、基于谱相干的可预测性对齐诊断框架。我们的框架有两个主要贡献:谱相干可预测性(SCP),一个计算高效($O(N\log N)$)且任务对齐的分数,用于量化给定预测实例的固有难度;以及线性利用率(LUR),一个频率分辨的诊断工具,精确测量模型如何有效利用数据中的线性可预测信息。我们验证了框架的有效性,并利用它揭示了两个核心见解。首先,我们提供了“可预测性漂移”的首个系统性证据,表明任务的预测难度随时间剧烈变化。其次,我们的评估揭示了一个关键的架构权衡:复杂模型在低可预测性数据上表现优越,而线性模型在更可预测的任务上非常有效。我们倡导范式转变,超越简单的聚合分数,转向更具洞察力的、可预测性感知的评估,从而促进更公平的模型比较和更深入的模型行为理解。

英文摘要

In the era of increasingly complex AI models for time series forecasting, progress is often measured by marginal improvements on benchmark leaderboards. However, this approach suffers from a fundamental flaw: standard evaluation metrics conflate a model's performance with the data's intrinsic unpredictability. To address this pressing challenge, we introduce a novel, predictability-aligned diagnostic framework grounded in spectral coherence. Our framework makes two primary contributions: the Spectral Coherence Predictability (SCP), a computationally efficient ($O(N\log N)$) and task-aligned score that quantifies the inherent difficulty of a given forecasting instance, and the Linear Utilization Ratio (LUR), a frequency-resolved diagnostic tool that precisely measures how effectively a model exploits the linearly predictable information within the data. We validate our framework's effectiveness and leverage it to reveal two core insights. First, we provide the first systematic evidence of "predictability drift", demonstrating that a task's forecasting difficulty varies sharply over time. Second, our evaluation reveals a key architectural trade-off: complex models are superior for low-predictability data, whereas linear models are highly effective on more predictable tasks. We advocate for a paradigm shift, moving beyond simplistic aggregate scores toward a more insightful, predictability-aware evaluation that fosters fairer model comparisons and a deeper understanding of model behavior.

2602.01203 2026-05-28 cs.CL cs.LG

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

注意力汇聚在注意力层中锻造原生MoE:针对头部坍塌的汇聚感知训练

Zizhuo Fu, Wenxuan Zeng, Runsheng Wang, Meng Li

AI总结 本文通过理论和实证证明注意力汇聚自然构建了注意力层内的混合专家机制,并提出汇聚感知训练算法以缓解头部坍塌问题,提升模型性能。

Comments 2026 International Conference on Machine Learning (ICML)

详情
AI中文摘要

大型语言模型(LLMs)通常将不成比例的注意力分配给第一个标记,这种现象称为注意力汇聚。最近的几种方法旨在解决这个问题,包括GPT-OSS中的汇聚注意力和Qwen3-Next中的门控注意力。然而,缺乏对这些注意力机制之间关系的全面分析。在这项工作中,我们提供了理论和实证证据,表明普通注意力和汇聚注意力中的汇聚自然地在注意力层内构建了混合专家(MoE)机制。这一见解解释了先前工作中观察到的头部坍塌现象,即只有固定子集的注意力头对生成有贡献。为了缓解头部坍塌,我们提出了一种汇聚感知训练算法,该算法带有专为注意力层设计的辅助负载平衡损失。大量实验表明,我们的方法在普通注意力、汇聚注意力和门控注意力上实现了有效的头部负载平衡,并提高了模型性能。我们希望这项研究能为注意力机制提供新的视角,并鼓励进一步探索注意力层内固有的MoE结构。

英文摘要

Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. Extensive experiments show that our method achieves effective head load balancing and improves model performance across Vanilla Attention, Sink Attention, and Gated Attention. We hope this study offers a new perspective on attention mechanisms and encourages further exploration of the inherent MoE structure within attention layers.

2512.14340 2026-05-28 cs.RO

Field evaluation and optimization of a lightweight autonomous lidar-based UAV system based on a rigorous experimental setup in boreal forest environments

基于严格实验设置的轻量级自主激光雷达无人机系统在北方森林环境中的现场评估与优化

Aleksi Karhunen, Teemu Hakala, Väinö Karjalainen, Eija Honkavaara

AI总结 提出标准化实验设置评估自主林下无人机系统,通过轻量级激光雷达四旋翼在北方森林中的93次真实飞行验证,优化后系统在中难度森林中1m/s和2m/s速度下成功率分别为12/15和15/15,在困难森林中为12/15和5/15。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

近年来,利用自主无人机进行林下森林遥感引起了越来越多的兴趣,导致科学文献中发表了大量自主飞行算法。为了支持此类算法的选择和开发,基于已发表研究对现有方法进行可靠比较至关重要。然而,由于实验设置差异很大且报告实践不完整,目前可靠比较面临挑战。本研究提出了一种标准化的实验设置,用于评估自主林下无人机系统,以填补这一空白。所提出的设置强调森林复杂性的定量报告、测试环境的可视化表示、多次重复飞行的执行,以及飞行成功率与定性飞行结果的报告。此外,鼓励在多个目标速度下飞行,并报告实际飞行速度、任务完成时间和点对点飞行距离。该设置通过采用最先进开源算法的轻量级激光雷达四旋翼进行演示,并在两个天然北方森林环境中进行了大量实验评估。基于对原始系统的系统评估,引入了若干改进。随后对优化后的系统重复相同的实验协议,总共进行了93次真实世界飞行。优化后的系统在中难度森林中,目标飞行速度为1 m/s和2 m/s时分别实现了12/15和15/15的成功率,在困难森林中分别为12/15和5/15。采用所提出的实验设置将有助于基于文献的自主林下飞行系统比较,并支持未来基于无人机的森林机器人解决方案的系统性能改进。

英文摘要

Interest in utilizing autonomous uncrewed aerial vehicles (UAVs) for under-canopy forest remote sensing has increased in recent years, resulting in the publication of numerous autonomous flight algorithms in the scientific literature. To support the selection and development of such algorithms, a reliable comparison of existing approaches based on published studies is essential. However, reliable comparisons are currently challenging due to widely varying experimental setups and incomplete reporting practices. This study proposes a standardized experimental setup for evaluating autonomous under-canopy UAV systems to fill this gap. The proposed setup emphasizes quantitative reporting of forest complexity, visual representation of test environments, execution of multiple repeated flights, and reporting of flight success rates alongside qualitative flight results. In addition, flights at multiple target speeds are encouraged, with reporting of realized flight speed, mission completion time, and point-to-point flight distance. The proposed setup is demonstrated using a lightweight lidar-based quadrotor employing state-of-the-art open-source algorithms, evaluated through extensive experiments in two natural boreal forest environments. Based on a systematic evaluation of the original system, several improvements were introduced. The same experimental protocol was then repeated with the optimized system, resulting in a total of 93 real-world flights. The optimized system achieved success rates of 12/15 and 15/15 at target flight speeds of 1 m/s and 2 m/s, respectively, in a medium-difficulty forest, and 12/15 and 5/15 in a difficult forest. Adoption of the proposed experimental setup would facilitate the literature-based comparison of autonomous under-canopy flight systems and support systematic performance improvement of future UAV-based forest robotics solutions.

2601.23262 2026-05-28 cs.LG

Particle-Guided Diffusion Models for Partial Differential Equations

粒子引导的偏微分方程扩散模型

Andrew Millard, Fredrik Lindsten, Zheng Zhao

AI总结 提出一种粒子引导的随机采样方法,结合扩散模型与基于PDE残差和观测约束的物理引导,通过序贯蒙特卡洛框架实现可扩展的生成式PDE求解器,在多个基准和多物理场系统中数值误差低于现有方法。

详情
AI中文摘要

我们引入了一种引导随机采样方法,该方法通过来自偏微分方程残差和观测约束的物理引导来增强扩散模型的采样,确保生成的样本保持物理可行性。我们将此采样过程嵌入到一个新的序贯蒙特卡洛框架中,从而得到一个可扩展的生成式PDE求解器。在多个基准PDE系统以及多物理场和相互作用PDE系统中,我们的方法产生的解场数值误差低于现有最先进的生成方法。

英文摘要

We introduce a guided stochastic sampling method that augments sampling from diffusion models with physics-based guidance derived from partial differential equation (PDE) residuals and observational constraints, ensuring generated samples remain physically admissible. We embed this sampling procedure within a new Sequential Monte Carlo (SMC) framework, yielding a scalable generative PDE solver. Across multiple benchmark PDE systems as well as multiphysics and interacting PDE systems, our method produces solution fields with lower numerical error than existing state-of-the-art generative methods.

2510.08525 2026-05-28 cs.CL

Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

哪些注意力头对推理重要?RL引导的KV缓存压缩

Wenjie Du, Li Jiang, Keda Tao, Xue Liu, Huan Wang

AI总结 提出RLKV方法,利用强化学习识别对推理质量关键的注意力头,并对其保留完整KV缓存而对其他头进行激进压缩,实现20-60%缓存减少且性能近乎无损。

详情
AI中文摘要

推理型大语言模型通过扩展的思维链生成展现出复杂的推理行为,这些行为在解码过程中对信息损失高度敏感,给KV缓存压缩带来了关键挑战。现有的token丢弃方法通过移除中间步骤直接破坏推理链,而为检索任务设计的头重分配方法无法保留对生成推理至关重要的注意力头。然而,现有方法均无法识别哪些注意力头真正维持推理一致性并控制生成终止。为解决此问题,我们提出RLKV,它使用强化学习作为探针,通过直接优化注意力头缓存使用与实际生成结果的关系,发现哪些头对推理质量有贡献。这一发现自然引出了高效的压缩策略:我们对推理关键的头分配完整KV缓存,同时对其他头使用固定大小的KV缓存进行激进压缩。实验表明,少数头对推理至关重要,使得在多种任务和模型上实现20-60%的缓存减少且性能近乎无损,在60%压缩率下实现高达2.06倍的端到端加速。

英文摘要

Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Existing token-dropping methods directly disrupt reasoning chains by removing intermediate steps, while head-reallocation methods, designed for retrieval tasks, fail to preserve the heads essential for generative reasoning. However, no existing method can identify which attention heads genuinely maintain reasoning consistency and control generation termination. To address this, we propose RLKV, which uses reinforcement learning as a probe to discover which heads contribute to reasoning quality by directly optimizing their cache usage against actual generation outcomes. This discovery naturally leads to an efficient compression strategy: we allocate full KV cache to reasoning-critical heads while aggressively compressing others with constant-size KV cache. Experiments reveal that a fraction of heads proves essential for reasoning, enabling 20--60% cache reduction with near-lossless performance across diverse tasks and models, and up to 2.06x end-to-end speedup at 60% reduction.

2507.16679 2026-05-28 cs.CL cs.AI cs.CY

PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

PICACO: 通过总相关优化实现大语言模型的多元情境价值对齐

Han Jiang, Dongyao Zhu, Xiaoyuan Yi, Ziang Xiao, Zhihua Wei, Xing Xie

AI总结 针对情境对齐中价值冲突导致的指令瓶颈问题,提出PICACO方法,通过优化元指令并最大化指定价值与模型响应的总相关,无需微调即可实现多元价值平衡对齐。

Comments ICML 2026

详情
AI中文摘要

情境学习在使大语言模型与人类价值对齐方面展现出巨大潜力,有助于减少有害输出并适应多样化偏好,而无需昂贵的后训练,这被称为情境对齐。然而,大语言模型对输入提示的理解仍是不可知的,限制了情境对齐处理价值冲突的能力——人类价值本质上是多元的,常常施加相互冲突的要求,例如刺激与传统。因此,当前的情境对齐方法面临指令瓶颈挑战,即大语言模型难以在单个提示中协调多个预期价值,导致对齐不完整或有偏。为了解决这个问题,我们提出了PICACO,一种新颖的多元情境对齐方法。无需微调,PICACO优化一个融合了多个价值的元指令,以更好地激发大语言模型对这些价值的理解并改进对齐。这是通过最大化指定价值与大语言模型响应之间的总相关来实现的,这从理论上强化了价值一致性并减少了干扰噪声,从而产生更有效的指令。在五个价值集上的大量实验表明,PICACO在黑盒和开源大语言模型上均表现良好,优于多个近期强基线,并在多达8个不同价值之间实现了更好的平衡。

英文摘要

In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions--human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that incorporates multiple values to better elicit LLMs' understanding of them and improve alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, which theoretically reinforces value conformity and reduces distractive noise, resulting in more effective instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.

2601.21666 2026-05-28 cs.AI cs.CV

SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

SONIC-O1:用于评估多模态大语言模型在音视频理解上的真实世界基准

Ahmed Y. Radwan, Christos Emmanouilidis, Hina Tabassum, Deval Pandya, Shaina Raza

AI总结 提出SONIC-O1基准,包含60小时人工验证的音视频数据,评估多模态大语言模型在开放摘要、多项选择问答和时序定位上的能力,发现模型在时序定位上存在显著性能差距和人口统计偏差。

详情
AI中文摘要

多模态大语言模型(MLLMs)是近期AI研究的主要焦点。然而,大多数先前工作集中于静态图像理解,而它们处理序列音视频数据的能力仍未充分探索。这一差距凸显了需要一个高质量基准来系统评估MLLM在真实世界场景中的性能。我们介绍了SONIC-O1,一个全面的、完全人工验证的基准,包含60小时(231个片段)跨越13个真实世界对话领域的数据,带有4,958个注释和人口统计元数据。SONIC-O1评估三种能力:开放摘要、多项选择题(MCQ)回答以及带有支持理由(推理)的时序定位。在闭源和开源模型中,我们发现MCQ准确率显示模型家族之间的差距最小,但最好的闭源模型在时序定位上比最好的开源模型高出22.6%。我们进一步观察到不同人口统计组在时序定位上的准确率差距高达21.4%,表明模型行为存在持续差异。SONIC-O1为基于时序和人口统计鲁棒的多模态理解提供了一个开放评估套件。SONIC-O1公开可用于研究:项目页面(https://vectorinstitute.github.io/sonic-o1/)、数据集(https://huggingface.co/datasets/vector-institute/sonic-o1)、GitHub(https://github.com/vectorinstitute/sonic-o1)、排行榜(https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard)。

英文摘要

Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark of 60 hours (231 clips) spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates three capabilities: open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Across closed- and open-source models, we find that the MCQ accuracy shows the smallest gap between model families, but the best closed-source model outperforms the best open-source model by 22.6% on temporal localization. We further observe accuracy gaps of up to 21.4% on temporal localization across demographic groups, indicating persistent disparities in model behaviour. SONIC-O1 provides an open evaluation suite for temporally grounded and demographically robust multimodal understanding. SONIC-O1 is publicly available for research: Project page (https://vectorinstitute.github.io/sonic-o1/), Dataset (https://huggingface.co/datasets/vector-institute/sonic-o1), GitHub (https://github.com/vectorinstitute/sonic-o1), Leaderboard (https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard).

2601.21167 2026-05-28 cs.LG

Learning What to Recommend: Minimax Optimal Simple Regret in Logistic Bandits

学习推荐什么:逻辑斯蒂老虎机中极小化最优简单遗憾

Shuai Liu, Alireza Bakhtiari, Alex Ayoub, Botao Hao, Csaba Szepesvári

AI总结 针对简单遗憾目标下的随机逻辑斯蒂老虎机,提出两种曲率感知算法(MULog和THATS),实现与下界匹配的遗憾上界,并揭示最优动作处sigmoid逆斜率κ_*决定极小化难度。

详情
AI中文摘要

我们研究在简单遗憾目标下具有$d$维动作特征的随机逻辑斯蒂老虎机,其中学习者使用$T$轮探索输出单个最终动作。逻辑斯蒂结构在此至关重要:因为动作的信息量取决于sigmoid的局部曲率,对即时奖励最优的动作不一定对识别最佳最终推荐最有用。我们表明一阶极小化难度由$κ_*$(sigmoid在最优动作处的逆斜率)主导。下界由一个移位饱和困难族实现,其中饱和同时限制了关于最终决策的可用信息,并控制了错误推荐的价值损失。这揭示了一种与累积遗憾构造不同的困难机制,尽管在线到批处理归约在期望上恢复了相同的领先阶。然后我们开发了两种曲率感知算法:\MULog,一种纯探索方法,其最终推荐满足阶为$ ilde O(d/\sqrt{κ_* T})$的高概率上界,与下界匹配至对数因子;以及\THATS,一种汤普森采样风格的方法,提供了计算上更轻的替代方案。在困难和简单几何上的实验支持相同的图景:信息性低奖励动作可以使实例显著更容易,而曲率感知方法特别有效地利用了这种结构。

英文摘要

We study stochastic logistic bandits with $d$-dimensional action features under the simple-regret objective, where a learner uses $T$ rounds of exploration to output a single final action. The logistic structure is essential here: because the informativeness of an action depends on the local curvature of the sigmoid, actions that are best for immediate reward need not be the most useful for identifying the best final recommendation. We show that the first-order minimax difficulty is governed by $κ_*$, the inverse slope of the sigmoid at the optimal action. The lower bound is realized by a shifted saturated hard family in which saturation simultaneously limits the information available about the final decision and controls the value loss from a wrong recommendation. This reveals a hard mechanism distinct from cumulative-regret constructions, even though online-to-batch reductions recover the same leading order in expectation. We then develop two curvature-aware algorithms: \MULog, a pure-exploration method whose final recommendation satisfies a high-probability upper bound of order $\tilde O(d/\sqrt{κ_* T})$, matching the lower bound up to logarithmic factors, and \THATS, a Thompson-sampling-style method that provides a computationally lighter alternative. Experiments on both hard and easy geometries support the same picture: informative low-reward actions can make instances substantially easier, and the curvature-aware methods exploit this structure especially effectively.

2510.11234 2026-05-28 cs.LG

Neural Weight Compression for Language Models

语言模型的神经权重压缩

Jegwang Ryu, Minkyu Kim, Seungjun Shin, Hee Min Choi, Dokwan Oh, Jaeho Lee

AI总结 提出神经权重压缩(NWC)框架,通过训练神经编解码器在预训练权重数据集上实现高效压缩,解决张量异质性和重建损失与下游性能不匹配问题,在4-6比特区间取得优异精度-压缩权衡。

详情
AI中文摘要

随着模型规模和部署的增长,语言模型权重的高效压缩变得越来越关键。然而,现有大多数方法依赖于手工设计的变换和启发式方法,反映出对权重作为数据模态的理解有限。为了超越这一范式,我们将权重压缩公式化为神经编解码器学习,并提出了神经权重压缩(NWC),一个在预训练权重数据集上训练神经编解码器的框架。NWC解决了权重压缩固有的挑战,包括张量异质性和重建损失与下游性能之间的不匹配。实验表明,NWC实现了极具竞争力的精度-压缩权衡,在4-6比特区间内尤其强劲,且不依赖刚性的手工设计组件(如Hadamard变换)。这些优势扩展到不同架构,例如视觉编码器。我们的分析强调了熵约束量化和学习变换在使压缩适应权重数据和下游任务中的作用。

英文摘要

Efficient compression of language model weights is increasingly critical as model scale and deployment grow. Yet, most existing methods rely on handcrafted transforms and heuristics, reflecting the limited understanding of weights as a data modality. To move beyond this paradigm, we formulate weight compression as neural codec learning and propose Neural Weight Compression (NWC), a framework for training neural codecs on pretrained weight datasets. NWC addresses challenges intrinsic to weight compression, including tensor heterogeneity and the mismatch between reconstruction losses and downstream performance. Experiments show that NWC achieves highly competitive accuracy-compression tradeoffs, with particularly strong results in the 4-6 bit regime, without relying on rigid handcrafted components such as the Hadamard transform. These gains extend to across diverse architectures, e.g., vision encoders. Our analysis highlights the roles of entropy-constrained quantization and learned transforms in adapting compression to weight data and downstream tasks.

2601.19926 2026-05-28 cs.CL cs.AI

The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models

Transformer的语法:语言模型中句法知识可解释性研究的系统综述

Nora Graichen, Iria de-Dios-Flores, Gemma Boleda

AI总结 通过对337篇文章的系统综述,评估基于Transformer的语言模型(TLM)的句法能力,发现TLM编码了非平凡的句法知识,但句法-语义接口现象表现较弱,且研究集中在英语和BERT类模型上。

详情
AI中文摘要

我们对337篇评估基于Transformer的语言模型(TLM)句法能力的文章进行了系统综述,报告了涵盖广泛句法现象、语言、模型和方法的3000多个数据点。这些数据共同表明,TLM编码了非平凡的句法知识。行为证据显示,TLM在形式句法现象上表现强劲,但在句法-语义接口现象上表现较弱且多变。对于数字支持较少的语言,表现也持续较低。探针和机制研究进一步支持TLM中存在句法知识。然而,由于大多数工作仍停留在观察层面,且当前方法在方法论上具有异质性,对句法处理背后的详细计算机制的洞察仍然有限。同时,文献仍然高度集中在英语和BERT类模型上。我们讨论了研究结果的意义,并为未来研究提供了建议。

英文摘要

We present a systematic review of 337 articles evaluating the syntactic abilities of Transformer-based language models (TLMs), reporting on over 3,000 datapoints spanning a wide range of syntactic phenomena, languages, models, and methods. We take the data to collectively show that TLMs encode a non-trivial amount of syntactic knowledge. Behavioral evidence shows strong performance on formal syntactic phenomena, but weaker and more variable performance on phenomena at the syntax-semantics interface. Performance is also consistently lower for languages with less digital support. Probing and mechanistic studies further support the presence of syntactic knowledge in TLMs. Yet, because most work remains observational and current approaches are methodologically heterogeneous, insight into the detailed computational mechanisms underlying syntactic processing remains limited. At the same time, the literature remains heavily concentrated on English and BERT-like models. We discuss the implications of our results and provide recommendations for future research.

2601.08131 2026-05-28 cs.CL

Attention Projection Mixing with Exogenous Anchors

基于外生锚点的注意力投影混合

Jonathan Su

AI总结 针对早期注意力投影跨层重用中内部锚点设计存在的结构冲突,提出ExoFormer模型,通过学习序列层外的外生锚点投影,并引入统一归一化混合框架,在减少令牌使用量的同时提升下游准确率。

详情
AI中文摘要

早期注意力投影的跨层重用可以改善优化和数据效率,但它造成了一个结构冲突:第一层必须同时作为所有更深层的稳定、可重用的锚点和有效的计算块。我们证明这种张力限制了内部锚点设计的性能。我们提出ExoFormer,通过在序列层堆栈之外学习外生锚点投影来解决这一冲突。我们引入了一个统一的归一化混合框架,该框架使用可学习的系数(探索系数粒度:元素级、头级和标量级)混合查询、键、值和门控对数,并表明归一化锚点源是稳定重用的关键。ExoFormer变体始终优于其内部锚点对应物,动态变体在匹配验证损失的情况下,使用比Gated Attention少1.5倍的令牌,获得1.5倍的下游准确率。我们通过卸载假说解释这种有效性:外部锚点保留必要的令牌身份,使层能够专门专注于特征变换。我们发布代码和模型以促进未来研究。

英文摘要

Cross-layer reuse of early attention projections can improve optimization and data efficiency, but it creates a structural conflict: the first layer must simultaneously act as a stable, reusable anchor for all deeper layers and as an effective computational block. We demonstrate that this tension constrains the performance of internal-anchor designs. We propose ExoFormer, which resolves the conflict by learning exogenous anchor projections outside the sequential layer stack. We introduce a unified normalized mixing framework that mixes queries, keys, values, and gate logits using learnable coefficients (exploring coefficient granularities: elementwise, headwise, and scalar), and we show that normalizing anchor sources is key to stable reuse. ExoFormer variants consistently outperform their internal-anchor counterparts, and the dynamic variant yields 1.5x downstream accuracy points while matching validation loss using 1.5x fewer tokens than Gated Attention. We explain this efficacy via an Offloading Hypothesis: external anchors preserve essential token identity, allowing layers to specialize exclusively in feature transformation. We release code and models to facilitate future research.

2509.06350 2026-05-28 cs.CL cs.AI cs.CR

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Mask-GCG:对抗性后缀中的所有标记对于越狱攻击都是必要的吗?

Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang

AI总结 提出Mask-GCG方法,通过可学习的标记掩码识别后缀中高影响力标记并剪枝低影响力标记,降低计算开销并保持攻击成功率,揭示LLM提示中的标记冗余。

Comments Accepted to ICASSP 2026

详情
Journal ref
2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 13887-13891, 2026
AI中文摘要

针对大型语言模型(LLM)的越狱攻击已展示了多种成功方法,攻击者操纵模型生成其本应避免的有害响应。其中,贪婪坐标梯度(GCG)作为一种通用且有效的方法,通过优化后缀中的标记来生成可越狱的提示。尽管已提出多种GCG的改进变体,但它们都依赖于固定长度的后缀。然而,这些后缀中潜在的冗余尚未被探索。在这项工作中,我们提出Mask-GCG,一种即插即用的方法,采用可学习的标记掩码来识别后缀中的高影响力标记。我们的方法增加了高影响力位置标记的更新概率,同时剪枝低影响力位置的标记。这种剪枝不仅减少了冗余,还降低了梯度空间的大小,从而减少了计算开销,并缩短了实现成功攻击所需的时间。我们将Mask-GCG应用于原始GCG及其多种改进变体进行评估。实验结果表明,后缀中的大多数标记对攻击成功有显著贡献,剪枝少数低影响力标记不会影响损失值或攻击成功率(ASR),从而揭示了LLM提示中的标记冗余。我们的发现从越狱攻击的角度为开发高效且可解释的LLM提供了见解。

英文摘要

Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.

2601.17737 2026-05-28 cs.CV cs.AI

The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

脚本即一切:一个用于长程对话到电影视频生成的智能体框架

Chenyu Mu, Xin He, Qu Yang, Wanshun Chen, Jiadi Yao, Huang Liu, Zihao Yi, Bo Zhao, Xingyu Chen, Ruotian Ma, Fanghua Ye, Erkun Yang, Cheng Deng, Zhaopeng Tu, Xiaolong Li, Linus

AI总结 提出一个端到端智能体框架,通过训练ScripterAgent将对话转化为精细脚本,并利用DirectorAgent跨场景连续生成策略,实现长程对话到电影视频的连贯生成,显著提升脚本忠实度和时间保真度。

详情
AI中文摘要

近期视频生成的进展产生了能够从简单文本提示合成惊艳视觉内容的模型。然而,这些模型难以从对话等高层概念生成连贯的长篇叙事,揭示了创意想法与其电影执行之间的“语义鸿沟”。为弥合这一鸿沟,我们引入了一个新颖的、端到端的智能体框架,用于对话到电影视频的生成。我们框架的核心是ScripterAgent,一个经过训练将粗略对话转化为精细、可执行的电影脚本的模型。为此,我们构建了ScriptBench,一个具有丰富多模态上下文的新大规模基准,通过专家引导的流程进行标注。生成的脚本随后指导DirectorAgent,它使用跨场景连续生成策略协调最先进的视频模型,以确保长程连贯性。我们的全面评估,包括一个AI驱动的CriticAgent和一个新的视觉-脚本对齐(VSA)指标,表明我们的框架在所有测试的视频模型上显著提高了脚本忠实度和时间保真度。此外,我们的分析揭示了当前SOTA模型在视觉奇观与严格脚本遵循之间的关键权衡,为自动化电影制作的未来提供了宝贵见解。

英文摘要

Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap'' between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.

2601.18116 2026-05-28 cs.CL

BEAR: Budgeted Evidence Allocation for Multi-Document Reasoning

BEAR: 面向多文档推理的预算化证据分配

Lin Sun, Linglin Zhang, Jingang Huang, Change Jia, Zhengwei Cheng, Xiangzheng Zhang

AI总结 提出BEAR框架,通过构建分层语义索引并在查询时进行由粗到细的证据访问,在固定证据预算下实现高效的多文档推理。

详情
AI中文摘要

我们认为多文档推理不仅受限于模型能读取的文本量,还受限于有限的查询时证据预算如何在文档和语义粒度之间分配。全上下文推理非选择性地向模型提供广泛证据且每次查询成本高,而平面分块检索通常返回局部相关但跨文档综合组织薄弱的段落。我们提出 extbf{BEAR},一个结构化证据分配框架,它离线构建分层语义索引,并在查询时通过互补的 extit{探索}和 extit{恢复}路径进行由粗到细的证据访问。这种由粗到细的设计可视为在固定证据上下文预算下的结构化证据分配。在合成和真实基准上,BEAR在DragonBall上表现尤为强劲,在HotpotQA上与强检索基线保持竞争力,并在我们的评估协议下在2Wiki上取得了最佳的基于检索的结果,同时其查询时证据预算远小于所报告的长上下文参考。进一步分析表明,性能提升与作为分配基础的分层结构以及互补的探索和恢复相关,而非仅靠语义分块。

英文摘要

We argue that multi-document reasoning is constrained not only by how much text a model can read, but also by how limited query-time evidence budget is allocated across documents and semantic granularities. Full-context inference exposes the model to broad evidence non-selectively and at high per-query cost, while flat chunk retrieval often returns locally relevant passages that are weakly organized for cross-document synthesis. We present \textbf{BEAR}, a framework for structured evidence allocation that builds hierarchical semantic indices offline and performs coarse-to-fine evidence access at query time through complementary \emph{exploration} and \emph{recovery} paths. This coarse-to-fine design can be viewed as structured evidence allocation under a fixed evidence-context budget. Across synthetic and real-world benchmarks, BEAR performs particularly strongly on DragonBall, remains competitive with strong retrieval-based baselines on HotpotQA, and yields the best retrieval-based result on 2Wiki under our evaluated protocol, while operating under substantially smaller \emph{query-time evidence budgets} than the reported long-context references. Additional analyses suggest that the gains are associated with hierarchy as an allocation substrate together with complementary exploration and recovery, rather than semantic chunking alone.

2404.06106 2026-05-28 cs.LG

Unifying Low Dimensional Spectra in Deep Learning

统一深度学习中的低维谱

Connall Garrod, Jonathan P. Keating

AI总结 本文利用无约束特征模型(UFM)证明深度神经坍缩(DNC)是多种深度学习矩阵(如Hessian、梯度和权重)中低维谱结构的统一来源,并给出了特征值和特征向量的解析构造。

Comments revised version; title changed slightly. 45 pages, 20 figures. Accepted at the International Conference on Machine Learning 2026

详情
AI中文摘要

在过参数化分类网络中,深度学习矩阵的特征谱中普遍出现低维结构。尽管理论进展旨在解释这一现象,但通常只能捕捉部分行为或依赖实践中不成立的假设。本文为几种典型的深度学习矩阵(包括Hessian、梯度和权重)的体加离群结构提供了解析解释。我们使用无约束特征模型(UFMs)——一种研究深度神经坍缩(DNC)出现的常用工具——来实现这一点。我们证明DNC是这些低维特征谱的根源,每种情况下,特征值和特征向量都可以从特征均值(DNC的表征对象)构造出来。这为深度学习中的广泛谱现象提供了统一的解析解释,并通过提供特征向量的详细分析,超越了通常仅关注特征值的经验刻画。我们证明结果对线性网络和ReLU网络均成立,并在建模语境和标准数据集上的标准深度网络架构中提供了数值验证。

英文摘要

Low dimensional structures appear ubiquitously in the eigenspectra of deep learning matrices in classification networks trained in the overparameterized regime. While theoretical advances have aimed to explain this phenomenology, they typically succeed only in capturing subsets of the full behavior or rely on assumptions that cannot hold in practice. In this work, we provide an analytic explanation for the bulk plus outlier structure of several canonical deep learning matrices, including the Hessian, gradients, and weights. We achieve this using unconstrained feature models (UFMs), a now-common tool for studying the emergence of deep neural collapse (DNC). We show that DNC is the source of these low dimensional eigenspectra, in each case, the eigenvalues and eigenvectors can be constructed from feature means, the characterizing objects of DNC. This provides a unifying analytic explanation for a wide range of spectral phenomena in deep learning and goes beyond empirical characterizations, which typically focus on eigenvalues, by providing a detailed analysis of eigenvectors. We prove that our results hold for both linear and ReLU networks and provide numerical validation in both the modeling context and standard deep-network architectures on canonical datasets.

2601.18006 2026-05-28 cs.CL

PEAR: Pairwise Evaluation for Automatic Relative Scoring in Machine Translation

PEAR:机器翻译中自动相对评分的成对评估

Lorenzo Proietti, Roman Grundkiewicz, Matt Post

AI总结 提出PEAR,一种监督式质量估计指标族,通过成对比较实现无参考机器翻译评估,预测质量差异方向和幅度,在WMT24基准上优于单候选基线,并有效用于最小贝叶斯风险解码。

Comments ACL 2026 Main Conference. 19 pages

详情
AI中文摘要

我们提出PEAR(成对评估用于自动相对评分),一种监督式质量估计(QE)指标族,将无参考机器翻译(MT)评估重新定义为分级成对比较。给定一个源片段和两个候选翻译,PEAR预测它们质量差异的方向和幅度。这些指标使用从人工判断差异中导出的成对监督进行训练,并添加一个正则化项,鼓励在候选顺序反转时符号反转。在WMT24元评估基准上,PEAR优于使用相同数据和骨干网络训练的严格匹配的单候选QE基线,隔离了所提出的成对公式的优势。尽管使用的参数远少于近期的大指标,PEAR超越了更大的QE模型和基于参考的指标。我们的分析进一步表明,与其他顶级指标相比,PEAR产生更少冗余的评估信号。最后,我们展示PEAR是用于最小贝叶斯风险(MBR)解码的有效效用函数,以可忽略的影响降低了成对评分成本。

英文摘要

We present PEAR (Pairwise Evaluation for Automatic Relative Scoring), a supervised quality estimation (QE) metric family that reframes reference-free machine translation (MT) evaluation as a graded pairwise comparison. Given a source segment and two candidate translations, PEAR predicts the direction and magnitude of their quality difference. The metrics are trained using pairwise supervision derived from differences in human judgments, with an additional regularization term that encourages sign inversion under candidate order reversal. On the WMT24 meta-evaluation benchmark, PEAR outperforms strictly matched single-candidate QE baselines trained with the same data and backbones, isolating the benefit of the proposed pairwise formulation. Despite using substantially fewer parameters than recent large metrics, PEAR surpasses far larger QE models and reference-based metrics. Our analysis further indicates that PEAR yields a less redundant evaluation signal relative to other top metrics. Finally, we show that PEAR is an effective utility function for minimum Bayes risk (MBR) decoding, reducing pairwise scoring cost at negligible impact.

2508.14082 2026-05-28 cs.LG

Toward Robust Semi-supervised Regression via Dual-stream Knowledge Distillation

通过双流知识蒸馏实现鲁棒半监督回归

Ye Su, Hezhe Qiao, Wei Huang, Lin Chen

AI总结 针对半监督回归中未标记数据利用不足和伪标签噪声问题,提出双流知识蒸馏框架(DKD),通过蒸馏连续值知识和分布信息,并结合解耦分布对齐模块,提升回归预测的鲁棒性和样本效率。

Comments 12 pages

详情
AI中文摘要

半监督回归(SSR)旨在预测样本的连续分数,同时减少对大规模标记数据的依赖,近年来在计算机视觉、自然语言处理、音频分析和医学分析等各种应用中引起了广泛关注。现有的SSR方法通常通过引入基于约束的正则化或序数排序来使用稀缺的标记数据训练模型,以减轻过拟合。然而,这些方法往往未能充分利用丰富的未标记样本。尽管一致性驱动的伪标签方法试图纳入未标记数据,但其性能对伪标签质量和噪声预测高度敏感。为了解决这些挑战,我们提出了一个双流知识蒸馏框架(DKD),专门为SSR设计,用于蒸馏连续值知识和分布信息。这种设计更好地保留了回归幅度信息并提高了样本效率。具体来说,在DKD中,教师模型仅使用真实标签进行优化以进行标签分布估计,而学生模型则从真实标签和教师生成的未标记数据伪目标中学习。蒸馏过程实现了有效的监督转移,使学生能够更鲁棒地利用伪标签。此外,我们引入了一个解耦分布对齐(DDA)模块,该模块分别对齐教师和学生之间的目标分布和非目标分布。为了提高非目标知识转移的可靠性,DDA包含一个方差引导的非目标分布对齐策略,该策略自适应地降低不确定的教师预测的权重,从而增强学生减轻伪标签监督中噪声的能力,并学习一个更好校准的回归预测器。

英文摘要

Semi-supervised regression (SSR), which aims to predict continuous scores for samples while reducing the reliance on large-scale labeled data, has recently attracted considerable attention across various applications, including computer vision, natural language processing, audio analysis, and medical analysis. Existing SSR methods typically train models with scarce labeled data by introducing constraint-based regularization or ordinal ranking to mitigate overfitting. However, these approaches often fail to fully exploit the abundance of unlabeled samples. Although consistency-driven pseudo-labeling methods attempt to incorporate unlabeled data, their performance is highly sensitive to pseudo-label quality and noisy predictions. To address these challenges, we propose a Dual-stream Knowledge Distillation framework (DKD), which is specifically designed for SSR to distill both continuous-valued knowledge and distributional information. This design better preserves regression magnitude information and improves sample efficiency. Specifically, in DKD, the teacher is optimized solely with ground-truth labels for label distribution estimation, while the student learns from a mixture of real labels and teacher-generated pseudo targets on unlabeled data. The distillation process enables effective supervision transfer, allowing the student to leverage pseudo labels more robustly. Furthermore, we introduce a Decoupled Distribution Alignment (DDA) module, which separately aligns the target and non-target distributions between the teacher and student. To improve the reliability of non-target knowledge transfer, DDA incorporates a variance-guided non-target distribution alignment strategy that adaptively downweights uncertain teacher predictions, thereby enhancing the student's ability to mitigate noise in pseudo-label supervision and learn a better-calibrated regression predictor.

2601.15015 2026-05-28 cs.LG

Plug-and-Play Benchmarking of Reinforcement Learning Algorithms for Large-Scale Flow Control

强化学习算法在大规模流动控制中的即插即用基准测试

Jannis Becktepe, Aleksandra Franz, Nils Thuerey, Sebastian Peitz

AI总结 提出首个完全基于PyTorch、可微分的强化学习流动控制基准套件FluidGym,通过标准化评估协议实现控制方法的系统比较。

Comments Accepted to ICML 2026. Code available at https://github.com/safe-autonomous-systems/fluidgym

详情
AI中文摘要

强化学习(RL)在主动流动控制(AFC)中显示出有希望的结果,但由于现有研究依赖于异构的观测和驱动方案、数值设置和评估协议,该领域的进展仍然难以评估。当前的AFC基准试图解决这些问题,但严重依赖外部计算流体动力学(CFD)求解器,不是完全可微分的,并且对3D和多智能体的支持有限。为了克服这些限制,我们引入了FluidGym,这是第一个独立的、完全可微分的AFC中RL基准套件。FluidGym完全在PyTorch中构建,基于GPU加速的PICT求解器,在单个Python堆栈中运行,不需要外部CFD软件,并提供标准化的评估协议。我们展示了使用PPO、SAC、DPC和TD-MPC的基线结果,并将所有环境、数据集和训练模型作为公共资源发布。FluidGym能够系统比较控制方法,为基于学习的流动控制的未来研究建立可扩展的基础,并可在github.com/safe-autonomous-systems/fluidgym获取。

英文摘要

Reinforcement learning (RL) has shown promising results in active flow control (AFC), yet progress in the field remains difficult to assess as existing studies rely on heterogeneous observation and actuation schemes, numerical setups, and evaluation protocols. Current AFC benchmarks attempt to address these issues but heavily rely on external computational fluid dynamics (CFD) solvers, are not fully differentiable, and provide limited 3D and multi-agent support. To overcome these limitations, we introduce FluidGym, the first standalone, fully differentiable benchmark suite for RL in AFC. Built entirely in PyTorch on top of the GPU-accelerated PICT solver, FluidGym runs in a single Python stack, requires no external CFD software, and provides standardized evaluation protocols. We present baseline results with PPO, SAC, DPC, and TD-MPC, and release all environments, datasets, and trained models as public resources. FluidGym enables systematic comparison of control methods, establishes a scalable foundation for future research in learning-based flow control, and is available at github.com/safe-autonomous-systems/fluidgym.

2505.17654 2026-05-28 cs.CL cs.AI

EVADE-Bench: Multimodal Benchmark for Evaluating and Enhancing Evasive Content Detection

EVADE-Bench:用于评估和增强规避性内容检测的多模态基准

Ancheng Xu, Zhihao Yang, Jingpeng Li, Guanghu Yuan, Longze Chen, Liang Yan, Jiehui Zhou, Zhen Qin, Hengyu Chang, Yukun Chen, Hamid Alinejad-Rokny, Min Yang

AI总结 针对电商平台中LLM/VLM易受规避性内容攻击的问题,提出首个专家标注的中文多模态基准EVADE-Bench,评估26个模型并发现规则分类可提升检测一致性,多智能体分解策略能显著提高准确率。

Comments SIGIR 2026

详情
AI中文摘要

电商平台越来越依赖大型语言模型(LLMs)和视觉语言模型(VLMs)来检测非法或误导性产品内容。然而,这些模型仍然容易受到规避性内容的影响,即通过分词、委婉语言或图像裁剪等技术故意修改的输入,以掩盖违反政策的行为,同时仍传达被禁止的主张。关键在于,检测此类内容需要模型同时掌握两种能力:准确理解复杂规则,以及正确推断故意混淆的多模态输入背后的真实意图。虽然先前的工作分别探索了LLM对复杂规则的推理和基于LLM的规避性内容检测,但现有基准尚未将两者结合在统一的评估框架内。这一差距在电商领域尤为严重,因为准确的审核要求这两种能力协同运作。为填补这一空白,我们引入了EVADE-Bench,这是首个专家策划的中文多模态基准,专门设计用于评估LLMs和VLMs在真实电商场景中的规避性内容检测。我们对26个开源和闭源LLMs及VLMs的全面评估显示,即使是最先进的模型也经常错误分类规避性样本。我们进一步证明,更清晰的规则分类显著提高了模型预测的一致性并减少了错误预测,凸显了基准设计在实现可靠评估中的关键作用。为了探索性能提升的路径,我们研究了多智能体分解在多模态推理中的可行性,即将视觉描述和逻辑推理解耦为独立的智能体,并发现这一策略带来了显著的准确率提升。

英文摘要

E-commerce platforms increasingly rely on Large Language Models (LLMs) and Vision Language Models (VLMs) to detect illicit or misleading product content. However, these models remain vulnerable to evasive content, which refers to inputs that have been deliberately modified through techniques such as word splitting, euphemistic language, or image cropping to conceal policy violations while still conveying prohibited claims. Crucially, detecting such content requires a model to simultaneously master two capabilities: accurately comprehending complex rules, and correctly inferring the true intent behind deliberately obfuscated multimodal inputs. While prior work has separately explored LLM reasoning over complex rules and LLM-based detection of evasive content, no existing benchmark combines both within a unified evaluation framework. This gap is particularly consequential in e-commerce, where accurate moderation demands that both capabilities operate in concert. To address this gap, we introduce EVADE-Bench, the first expert-curated Chinese multimodal benchmark specifically designed to evaluate LLMs and VLMs on evasive content detection in real-world e-commerce scenarios. Our comprehensive evaluation of 26 open- and closed-source LLMs and VLMs reveals that even state-of-the-art models frequently misclassify evasive samples. We further demonstrate that clearer rule categorization significantly improves model prediction consistency and reduces false predictions, highlighting the critical role of benchmark design in enabling reliable evaluation. To explore paths for performance improvement, we investigate the feasibility of multi-agent decomposition for multimodal reasoning, wherein visual description and logical inference are decoupled into separate agents, and find that this strategy yields notable accuracy gains.

2601.12154 2026-05-28 cs.CL

Analyzing Cancer Patients' Experiences with Embedding-based Topic Modeling and LLMs

基于嵌入的主题建模和LLM分析癌症患者的体验

Teodor-Călin Ionescu, Lifeng Han, Jan Heijdra Suasnabar, Anne Stiggelbout, Suzan Verberne

AI总结 本研究利用BERTopic和Top2Vec等神经主题建模方法,结合LLM(GPT4)进行主题标注,从癌症患者访谈数据中提取有意义主题,并评估不同嵌入模型的效果,发现领域特定的BioClinicalBERT嵌入能提高主题精度和可解释性。

Comments accepted by the CLIN journal. The CLIN Journal is the journal for research in computational linguistics in The Netherlands and Belgium

详情
AI中文摘要

本研究探讨了使用神经主题建模和LLM从患者叙述数据中发现有意义主题的方法,以提供有助于更以患者为中心的医疗实践的见解。我们分析了一组转录的癌症患者访谈(13次访谈,共132,722词)。首先,我们通过使用相似的预处理、分块和聚类配置,评估BERTopic和Top2Vec在单个访谈摘要中的关键词提取性能,以确保公平比较。然后,使用LLM(GPT4)进行下一步的主题标注。通过小规模人工评估,对单个访谈(I0)的输出进行评分,重点关注{连贯性}、{清晰度}和{相关性}。基于初步结果和评估,BERTopic表现出更强的性能,并被选用于进一步实验,使用三种{临床导向的嵌入}模型。然后,我们使用最佳模型设置分析了完整的访谈集合。结果表明,领域特定的嵌入提高了主题的 extit{精确度}和 extit{可解释性},其中BioClinicalBERT在转录中产生最一致的结果。使用BioClinicalBERT嵌入模型对全部13次访谈的全局分析揭示了所有13次访谈中最主要的主题,即“癌症护理管理中的协调与沟通”和“患者癌症治疗旅程中的决策”。尽管这些访谈是从荷兰语机器翻译成英语,且临床专业人员未参与评估,但研究结果表明,神经主题建模,特别是BERTopic,可以帮助从患者访谈中为临床医生提供有用的反馈。该流程可以支持更高效的文档导航,并加强患者在医疗工作流程中的声音。

英文摘要

This study investigates the use of neural topic modeling and LLMs to uncover meaningful themes from patient storytelling data, to offer insights that could contribute to more patient-oriented healthcare practices. We analyze a collection of transcribed interviews with cancer patients (132,722 words in 13 interviews). We first evaluate BERTopic and Top2Vec for individual interview summarization by using similar preprocessing, chunking, and clustering configurations to ensure a fair comparison on Keyword Extraction. LLMs (GPT4) are then used for the next step topic labeling. Their outputs for a single interview (I0) are rated through a small-scale human evaluation, focusing on {coherence}, {clarity}, and {relevance}. Based on the preliminary results and evaluation, BERTopic shows stronger performance and is selected for further experimentation using three {clinically oriented embedding} models. We then analyzed the full interview collection with the best model setting. Results show that domain-specific embeddings improved topic \textit{precision} and \textit{interpretability}, with BioClinicalBERT producing the most consistent results across transcripts. The global analysis of the full dataset of 13 interviews, using the BioClinicalBERT embedding model, reveals the most dominant topics throughout all 13 interviews, namely ``Coordination and Communication in Cancer Care Management" and ``Patient Decision-Making in Cancer Treatment Journey''. Although the interviews are machine translations from Dutch to English, and clinical professionals are not involved in this evaluation, the findings suggest that neural topic modeling, particularly BERTopic, can help provide useful feedback to clinicians from patient interviews. This pipeline could support more efficient document navigation and strengthen the role of patients' voices in healthcare workflows.

2601.10714 2026-05-28 cs.CV cs.GR

Alterbute: Editing Intrinsic Attributes of Objects in Images

Alterbute: 编辑图像中物体的内在属性

Tal Reiss, Daniel Winter, Matan Cohen, Alex Rav-Acha, Yael Pritch, Ariel Shamir, Yedid Hoshen

AI总结 提出Alterbute方法,通过扩散模型结合松弛训练目标和视觉命名实体,在保持物体身份和场景上下文的同时编辑颜色、纹理、材质和形状等内在属性。

Comments ICML 2026. Project page is available at https://talreiss.github.io/alterbute/

详情
AI中文摘要

我们介绍了Alterbute,一种基于扩散的方法,用于编辑图像中物体的内在属性。我们允许改变物体的颜色、纹理、材质甚至形状,同时保持其感知身份和场景上下文。现有方法要么依赖无监督先验,往往无法保持身份,要么使用过度严格的监督,阻止有意义的内部变化。我们的方法依赖于:(i) 一个松弛的训练目标,允许模型在身份参考图像、描述目标内在属性的文本提示以及定义外在上下文的背景图像和物体掩码的条件下,改变内在和外在属性。在推理时,我们通过重用原始背景和物体掩码来限制外在变化,从而确保只改变所需的内在属性;(ii) 视觉命名实体(VNEs)——细粒度的视觉身份类别(例如“保时捷911 Carrera”),这些类别将共享身份定义特征的物体分组,同时允许内在属性的变化。我们使用视觉语言模型从大型公共图像数据集中自动提取VNE标签和内在属性描述,从而实现可扩展的、保持身份的监督。Alterbute在保持身份的物体内在属性编辑方面优于现有方法。

英文摘要

We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) - fine-grained visual identity categories (e.g., ''Porsche 911 Carrera'') that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.

2601.10334 2026-05-28 cs.CV cs.LG

An analytic theory of convolutional neural network inverse problems solvers

卷积神经网络逆问题求解器的解析理论

Minh Hai Nguyen, Quoc Bao Do, Edouard Pauwels, Pierre Weiss

AI总结 通过最小均方误差估计器引入平移等变性和有限感受野的归纳偏置,推导出局部等变MMSE的解析公式,并在多种逆问题、数据集和架构上验证其与神经网络输出高度一致。

详情
Journal ref
Forty-Third International Conference on Machine Learning, 2026
AI中文摘要

监督卷积神经网络(CNN)被广泛用于解决成像逆问题,在众多应用中取得了最先进的性能。然而,尽管取得了经验上的成功,这些方法从理论角度仍缺乏理解,常被视为黑箱。为弥合这一差距,我们通过最小均方误差(MMSE)估计器的视角分析训练后的神经网络,并引入捕获CNN两个基本归纳偏置(平移等变性和通过有限感受野的局部性)的功能约束。在经验训练分布下,我们推导出这种约束变体(称为局部等变MMSE,LE-MMSE)的解析、可解释且易于计算的公式。通过在不同逆问题(去噪、修复、去卷积)、数据集(FFHQ、CIFAR-10、FashionMNIST)和架构(U-Net、ResNet、PatchMLP)上的大量数值实验,我们证明了我们的理论与神经网络输出相匹配(PSNR $\gtrsim25$dB)。此外,我们提供了对物理感知和物理无关估计器之间差异、训练(补丁)分布中高密度区域的影响以及其他因素(数据集大小、补丁大小等)影响的见解。

英文摘要

Supervised convolutional neural networks (CNNs) are widely used to solve imaging inverse problems, achieving state-of-the-art performance in numerous applications. However, despite their empirical success, these methods are poorly understood from a theoretical perspective and often treated as black boxes. To bridge this gap, we analyze trained neural networks through the lens of the Minimum Mean Square Error (MMSE) estimator, incorporating functional constraints that capture two fundamental inductive biases of CNNs: translation equivariance and locality via finite receptive fields. Under the empirical training distribution, we derive an analytic, interpretable, and tractable formula for this constrained variant, termed Local-Equivariant MMSE (LE-MMSE). Through extensive numerical experiments across various inverse problems (denoising, inpainting, deconvolution), datasets (FFHQ, CIFAR-10, FashionMNIST), and architectures (U-Net, ResNet, PatchMLP), we demonstrate that our theory matches the neural networks outputs (PSNR $\gtrsim25$dB). Furthermore, we provide insights into the differences between \emph{physics-aware} and \emph{physics-agnostic} estimators, the impact of high-density regions in the training (patch) distribution, and the influence of other factors (dataset size, patch size, etc).

2601.10085 2026-05-28 cs.CL

CALM-IT: Generating Realistic Long-Form Motivational Interviewing Dialogues with Dual-Actor Conversational Dynamics Tracking

CALM-IT: 通过双角色对话动态追踪生成逼真的长形式动机访谈对话

Viet Cuong Nguyen, Nhi Yen Nguyen, Kristin A. Candan, Mary Conlon, Vanessa Rumie, Kristen Risola, Michael L. Birnbaum, Munmun De Choudhury

AI总结 提出CALM-IT框架,通过显式建模客户与咨询师状态的演变来生成和评估长形式动机访谈对话,在8,232个合成对话语料上优于基线方法,尤其在MITI 4.2全局评分和客户接受率上表现最佳。

Comments 53 pages, in submission to EMNLP

详情
AI中文摘要

治疗性对话并非孤立响应的序列:客户目标、动机、抵抗和治疗联盟随时间演变。然而,当前基于LLM的心理健康对话系统通常缺乏在长时间交互中追踪这些动态的显式机制,可能导致时机不当的干预或过早的目标解决。我们引入CALM-IT,一个通过显式建模客户和咨询师状态演变来生成和评估长形式动机访谈对话的框架,指导咨询策略选择和话语生成。我们在包含8,232个合成对话的大规模语料库上评估CALM-IT,涵盖多种对话长度和框架。与所有基线相比,CALM-IT在大多数MITI 4.2全局评分(包括共情、伙伴关系和软化维持谈话)以及其他关键性能指标上取得最佳性能,且随着对话长度增加性能下降最小。值得注意的是,尽管CALM-IT发起的改变导向提示较少,但在不同长度条件下平均客户接受率最高(64.3%)。我们发布了一个可复现的生成框架、一个基于MITI的过程级评估协议,以及一个大规模合成语料库,用于在逼真的长形式交互条件下研究治疗性LLM。

英文摘要

Therapeutic dialogue is not a sequence of isolated responses: client goals, motivation, resistance, and therapeutic alliance evolve over time. Yet current LLM-based mental health dialogue systems often lack explicit mechanisms for tracking these dynamics across extended interactions, which can lead to poorly timed interventions or premature goal resolution. We introduce CALM-IT, a framework for generating and evaluating long-form Motivational Interviewing dialogues through explicit modeling of evolving client and counselor states, guiding both counseling strategy selection and utterance generation. We evaluate CALM-IT on a large-scale corpus of 8,232 synthetic dialogues spanning multiple dialogue lengths and frameworks. Compared with all baselines, CALM-IT achieves the best performance on most MITI 4.2 global ratings, including Empathy, Partnership, and Softening Sustain Talk, as well as on other key performance metrics while exhibiting minimal performance degradation as dialogue length increases. Notably, although CALM-IT initiates fewer change-directed prompts, it produces the highest client acceptance rate (64.3%) on average across different length conditions. We release a reproducible generation framework, a MITI-grounded process-level evaluation protocol, and a large-scale synthetic corpus for studying therapeutic LLMs under realistic long-form interaction conditions.

2601.08617 2026-05-28 cs.CV

SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning

SoC: 测试时提示调优的语义正交校准

Leo Fillioux, Omprakash Chakraborty, Ismail Ben Ayed, Paul-Henry Cournède, Stergios Christodoulidis, Maria Vakalopoulou, Jose Dolz

AI总结 针对视觉语言模型测试时提示调优中校准被忽视的问题,提出基于Huber的正则化方法SoC,在保持语义邻近性的同时实现平滑的原型分离,从而改善校准性能并保持判别能力。

详情
Journal ref
CVPR 2026
AI中文摘要

随着视觉语言模型(VLM)在医疗或自动驾驶等关键决策系统中的日益普及,对其不确定性估计的校准变得至关重要。然而,这一维度在VLM测试时提示调优(TPT)文献中尚未得到充分探索,该领域主要侧重于提升其判别性能。最近的最先进方法主张对文本提示嵌入对实施完全正交约束以增强可分离性,从而改善校准。然而,正如我们在理论上所示,完全正交约束的固有梯度会强烈地将语义相关的类别推开,最终使模型过度自信。基于我们的发现,我们提出了语义正交校准(SoC),一种基于Huber的正则化器,它在保持语义邻近性的同时实现平滑的原型分离,从而相比于先前的基于正交性的方法改善了校准。通过全面的实证验证,我们证明SoC在保持竞争性判别能力的同时,持续改善了校准性能。

英文摘要

With the increasing adoption of vision-language models (VLMs) in critical decision-making systems such as healthcare or autonomous driving, the calibration of their uncertainty estimates becomes paramount. Yet, this dimension has been largely underexplored in the VLM test-time prompt-tuning (TPT) literature, which has predominantly focused on improving their discriminative performance. Recent state-of-the-art advocates for enforcing full orthogonality over pairs of text prompt embeddings to enhance separability, and therefore calibration. Nevertheless, as we theoretically show in this work, the inherent gradients from fully orthogonal constraints will strongly push semantically related classes away, ultimately making the model overconfident. Based on our findings, we propose Semantic Orthogonal Calibration (SoC), a Huber-based regularizer that enforces smooth prototype separation while preserving semantic proximity, thereby improving calibration compared to prior orthogonality-based approaches. Across a comprehensive empirical validation, we demonstrate that SoC consistently improves calibration performance, while also maintaining competitive discriminative capabilities.

2601.07648 2026-05-28 cs.CL

What Are We Measuring in NLG? A Meta-Analysis of Evaluation Trends 2020-2025

我们在NLG中测量什么?2020-2025年评估趋势的元分析

Jing Yang, Nils Feldhus, Salar Mohtaj, Leonhard Hennig, Qianli Wang, Eleni Metheniti, Sherzod Hakimov, Charlott Jakob, Veronika Solopova, Konrad Rieck, David Schlangen, Sebastian Möller, Vera Schmitt

AI总结 通过元分析14,171篇论文,揭示NLG评估中的三个系统性问题:度量惯性、度量-标准映射问题和验证差距,并提出最小评估清单。

Comments 8 pages

详情
AI中文摘要

随着自然语言生成(NLG)主导现代NLP,可扩展评估仍然是一个关键瓶颈。因此,LLM作为评判者(LaaJ)的采用迅速加速,在2025年出现在比人工评估更多的论文中。这一关键转变促使对当前评估实践进行批判性分析。克服刚性关键词过滤和人工审查的限制,我们采用多LLM信息提取管道,从四大NLP会议(2020-2025)的14,171篇论文中收集结构化元数据。分析3,334篇过滤后的NLG论文,我们识别出三个系统性问题。(1)度量惯性:尽管转向开放式生成,传统词汇度量(BLEU、ROUGE)仍作为主要指标持续存在,通常与语义替代方案一起使用而非被取代。(2)度量-标准映射问题:我们的论文级共现数据显示,通用自动度量被用作质量的广泛代理,而未指定它们旨在评估文本生成的哪个维度。(3)验证差距:LaaJ在没有相应人工验证的情况下快速增长(少于8%的论文)。关键的是,虽然LaaJ与整体质量相关,但在流畅性等细粒度标准上一致性崩溃。为弥补这些差距,我们将发现提炼为一个最小评估清单,以指导度量选择、构念效度和LaaJ部署。

英文摘要

As Natural Language Generation (NLG) dominates modern NLP, scalable evaluation remains a critical bottleneck. Consequently, LLM-as-a-judge (LaaJ) adoption has accelerated rapidly, appearing in more papers than human evaluation in 2025. This pivotal shift motivates a critical analysis of current evaluation practices. Overcoming the limits of rigid keyword filtering and manual review, we employ a multi-LLM information extraction pipeline to gather structured metadata from 14,171 papers across four major NLP conferences (2020-2025). Analyzing 3,334 filtered NLG papers, we identify three systemic challenges. (1) Metric inertia: despite the shift toward open-ended generation, legacy lexical metrics (BLEU, ROUGE) persist as primary indicators, typically used alongside rather than replaced by semantic alternatives. (2) Metric-criteria mapping problem: our paper-level co-occurrence data reveals that general-purpose automatic metrics are applied as broad proxies for quality, without specifying which dimension of text generation they are intended to evaluate. (3) Validation gap: LaaJ has grown rapidly without commensurate human validation (fewer than 8% of papers). Crucially, while LaaJ correlates with aggregate quality, alignment collapses on fine-grained criteria like fluency. To address these gaps, we distill our findings into a minimal Evaluation Checklist to guide metric selection, construct validity, and LaaJ deployment.

2601.06329 2026-05-28 cs.CL cs.AI

On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

论口语语言模型评估中全局令牌困惑度的谬误

Chan-Jan Hsu, Liang-Hsuan Tseng, Yi-Cheng Lin, Yen-Chun Kuo, Ju-Chieh Chou, Kai-Wei Chang, Hung-yi Lee, Carlos Busso

AI总结 针对口语语言模型评估中直接使用文本困惑度公式计算语音令牌困惑度的问题,提出基于似然和生成的新型评估方法,更忠实反映生成质量,并缩小了最佳模型与人类基线之间的差距。

详情
AI中文摘要

在大规模原始音频上预训练的生成式口语语言模型能够以适当内容继续语音提示,同时保留说话人和情感等属性,作为口语对话的基础模型。在先前文献中,这些模型通常使用“全局令牌困惑度”进行评估,该指标直接将文本困惑度公式应用于语音令牌。然而,这种做法忽略了语音和文本模态之间的根本差异,可能导致对语音特性的低估。在这项工作中,我们提出了多种基于似然和生成的评估方法,以替代朴素的全局令牌困惑度。我们证明,所提出的评估更忠实地反映了感知生成质量,与人类评分的平均意见得分(MOS)具有更强的相关性。在新指标下评估时,口语语言模型的相对性能格局被重塑,揭示了最佳性能模型与人类基线之间的差距显著缩小。总之,这些结果表明,适当的评估对于准确评估口语语言建模的进展至关重要。

英文摘要

Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under the new metrics, the relative performance landscape of spoken language models is reshaped, revealing a significantly reduced gap between the best-performing model and the human topline. Together, these results suggest that appropriate evaluation is critical for accurately assessing progress in spoken language modeling.

2601.05386 2026-05-28 cs.AI

How Much Can a Few Engine Moves Help? Quantifying Limited Cheating in Chess

几次引擎走棋能有多大帮助?量化国际象棋中的有限作弊

Daniel Keren

AI总结 本文通过阈值策略和Bellman策略,在Stockfish引擎对弈中量化有限次作弊对棋手得分的影响,并引入无引擎模拟器优化超参数。

Comments Accepted, IEEE CoG 2026 (IEEE Conference on Games 2026). Replaces previous version "On the Effect of Cheating in Chess"

详情
AI中文摘要

国际象棋中利用强大软件建议作弊已成为一个主要问题,甚至达到最高水平。与以往大多数关注作弊检测的工作不同,本文尝试评估在比赛中有限次数作弊可能带来的性能提升。我们开发了基于阈值和Bellman风格的干预策略,并在使用Stockfish的受控引擎对引擎设置中进行测试。明智地选择1次或2次作弊分别得到平均得分0.71和0.82,而无作弊得分为0.51。我们还引入了一个快速、无引擎的模拟器,无需运行对局即可进行超参数优化,与基于引擎的最优值紧密匹配。本工作的目的不是帮助作弊者,而是衡量作弊的有效性——这对于遏制和检测作弊的努力至关重要。

英文摘要

Cheating in chess, by using advice from powerful software, has become a major problem, reaching the highest levels. As opposed to the large majority of previous work, which concerned {\em detection} of cheating, here we try to evaluate the possible gain in performance, obtained by cheating a limited number of times during a game. We develop threshold-based and Bellman-style intervention policies, and test them in a controlled engine-vs-engine setting using Stockfish. A judicious choice of 1 or 2 cheats yields average scores of 0.71 and 0.82, respectively, compared to 0.51 with no cheats. We also introduce a fast, engine-free simulator that enables hyperparameter optimization without running games, closely matching the engine-based optimum. The goal of this work is not to assist cheaters, but to measure the effectiveness of cheating -- which is crucial as part of the effort to contain and detect it.

2601.04876 2026-05-28 cs.SD

ChronosAudio: A Comprehensive Long-Audio Benchmark for Evaluating Audio-Large Language Models

ChronosAudio: 用于评估音频大语言模型的综合长音频基准

Kaiwen Luo, Liang Lin, Yibo Zhang, Moayad Aloqaily, Jialiang Tao, Dexian Wang, Zhenhong Zhou, Junwei Zhang, Kun Wang, Li Sun, Qingsong Wen

AI总结 提出首个针对音频大语言模型长音频理解的多任务基准ChronosAudio,包含6大任务类别和36000个测试实例,实验发现模型存在长上下文崩溃、注意力稀释等问题,现有缓解策略仅恢复50%性能。

详情
AI中文摘要

尽管音频大语言模型(ALLMs)取得了显著进展,但其长音频理解能力仍未得到探索。针对通用音频任务,已有大量基准被提出,但它们主要关注短视频片段,缺乏评估ALLMs在长时间跨度上的共识。本文提出ChronosAudio,这是首个针对ALLMs长音频理解定制的多任务基准。它包含六大任务类别,共36000个测试实例,总时长超过200小时,并按短、中、长三类进行分层,以全面评估长度泛化能力。使用ChronosAudio对16个最先进模型进行的广泛实验得出了三个关键发现:1. 急剧的长上下文崩溃:ALLMs表现出严重的性能维持能力不足,从短上下文过渡到长上下文时,特定任务的性能下降超过90%。2. 结构性注意力稀释:性能下降源于维持时间局部性的根本失败;注意力机制在后续序列中遭受显著扩散。3. 缓解措施的效果上限:当前策略仅能恢复50%的性能。这些发现揭示了长音频中的重大挑战,强调了迫切需要实现稳健的文档级音频推理的方法。

英文摘要

Although Audio Large Language Models (ALLMs) have witnessed substantial advancements, their long audio understanding capabilities remain unexplored. A plethora of benchmarks have been proposed for general audio tasks, they predominantly focus on short-form clips, leaving without a consensus on evaluating ALLMs over extended durations. This paper proposes ChronosAudio, the first multi-task benchmark tailored for long-audio understanding in ALLMs. It encompasses six major task categories and comprises 36,000 test instances totaling over 200 hours audio, stratified into short, middle, and long-form categories to comprehensively evaluate length generalization. Extensive experiments on 16 state-of-the-art models using ChronosAudio yield three critical findings: 1.Precipitous Long-Context Collapse: ALLMs exhibit a severe inability to sustain performance, with the transition from short to long contexts triggering a staggering performance degradation of over 90% in specific tasks. 2.Structural Attention Dilution: Performance degradation stems from a fundamental failure in maintaining temporal locality; attention mechanisms suffer from significant diffusion in later sequences. 3.Restorative Ceiling of Mitigation: Current strategies only offer 50% recovery. These findings reveal significant challenges in long-audio, underscoring the urgent need for approaches to achieve robust, document-level audio reasoning.

2601.03549 2026-05-28 cs.CV cs.CL

FEA-SLT: A Gloss-Free End-to-End Framework for Facial-Expression-Aware Sign Language Translation

FEA-SLT:一种面向面部表情感知的手语翻译的无词汇端到端框架

Guobin Tu, Di Weng

AI总结 提出FEA-SLT框架,通过面部表情感知融合模块利用面部动态作为语义锚点,解决无词汇手语翻译中手势歧义问题,在PHOENIX14T和CSL-Daily数据集上达到最优BLEU性能。

详情
AI中文摘要

手语翻译(SLT)是一项具有挑战性的跨模态任务,需要对手部动作和非手动信号进行联合建模。现有的无词汇SLT方法有效捕捉手势动态,但常常未充分利用面部表情,而面部表情在语法和消除歧义中起着关键作用。当不同概念共享相似手部配置时,这一限制可能导致语义退化。为解决此问题,我们提出FEA-SLT(面部表情感知手语翻译),一种无词汇端到端框架,利用面部动态作为语义锚点来消除手部歧义。FEA-SLT采用领域迁移的面部编码器提取表情敏感表示,并通过语言约束的面部表情感知融合(FEAF)模块将其与手部特征集成。FEAF通过双向调制捕捉手部和面部通道之间的相互依赖关系,增强句法保真度。在PHOENIX14T和CSL-Daily上的实验表明,FEA-SLT在无词汇方法中实现了最先进的BLEU性能,而针对性分析证实了其对面部敏感语句翻译的改进。代码可在[https://github.com/TuGuobin/FEA-SLT](https://github.com/TuGuobin/FEA-SLT)获取。

英文摘要

Sign Language Translation (SLT) is a challenging cross-modal task requiring joint modeling of manual articulations and non-manual signals. Existing gloss-free SLT methods effectively capture gestural dynamics but often underutilize facial expressions, which play crucial grammatical and disambiguating roles. This limitation can cause semantic degradation when distinct concepts share similar manual configurations. To address this issue, we propose FEA-SLT (**F**acial-**E**xpression-**A**ware **S**ign **L**anguage **T**ranslation), a gloss-free end-to-end framework that uses facial dynamics as semantic anchors for resolving manual ambiguity. FEA-SLT employs a domain-transferred facial encoder to extract expression-sensitive representations and integrates them with manual features through a linguistically constrained *Facial-Expression-Aware Fusion* (FEAF) module. FEAF captures reciprocal dependencies between manual and facial channels via bidirectional modulation, enhancing syntactic fidelity. Experiments on PHOENIX14T and CSL-Daily show that FEA-SLT achieves state-of-the-art BLEU performance among gloss-free methods, while targeted analyses confirm improved translation of facial-sensitive utterances. Code is available at [https://github.com/TuGuobin/FEA-SLT](https://github.com/TuGuobin/FEA-SLT).