arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2251
专题追踪
2602.01807 2026-05-28 cs.CL cs.LG

Sentence Curve Language Models

句子曲线语言模型

DongNyeong Heo, Taehwan Kim, Heeyoul Choi

发表机构 * Ulsan National Institute of Science and Technology(全南国立科学研究所) Handong Global University(翰昂全球大学)

AI总结 提出句子曲线表示,将扩散语言模型扩展为预测句子曲线而非静态词嵌入,以增强全局结构建模,并在IWSLT14和WMT14上取得最优性能。

详情
AI中文摘要

语言模型(LM)是现代AI系统的核心组成部分,扩散语言模型(DLM)最近已成为一种有竞争力的替代方案。这两种范式都依赖词嵌入来表示输入句子,以及骨干模型训练预测的目标句子。我们认为,这种目标词的静态嵌入对相邻词不敏感,鼓励局部准确的词预测,而全局句子结构则较少被强调。为了解决这个问题,我们提出了一种连续的句子表示,称为句子曲线,定义为一条样条曲线,其控制点影响句子中的多个词。基于这种表示,我们引入了句子曲线语言模型(SCLM),它将DLM扩展为预测句子曲线而非静态词嵌入。我们从理论上证明,句子曲线预测会引入正则化效应,促进全局结构建模,并刻画了不同句子曲线类型如何影响这种行为。实验上,SCLM在IWSLT14和WMT14上取得了DLM中的最优性能,训练稳定且无需繁重的知识蒸馏,并在LM1B上展现出与离散DLM相比有潜力的前景。

英文摘要

Language models (LMs) are a central component of modern AI systems, and diffusion language models (DLMs) have recently emerged as a competitive alternative. Both paradigms rely on word embeddings not only to represent the input sentence, but also to represent the target sentence that backbone models are trained to predict. We argue that such static embedding of the target word is insensitive to neighboring words, encouraging locally accurate word prediction while global sentence structure is less emphasized. To address this, we propose a continuous sentence representation, termed sentence curve, defined as a spline curve whose control points affect multiple words in the sentence. Based on this representation, we introduce sentence curve language model (SCLM), which extends DLMs to predict sentence curves instead of the static word embeddings. We theoretically show that sentence curve prediction induces a regularization effect that promotes global structure modeling, and characterize how different sentence curve types affect this behavior. Empirically, SCLM achieves state-of-the-art performance among DLMs on IWSLT14 and WMT14, shows stable training without burdensome knowledge distillation, and demonstrates promising potential compared to discrete DLMs on LM1B.

2602.02417 2026-05-28 cs.LG

Trust Region Continual Learning as an Implicit Meta-Learner

信任区域持续学习作为隐式元学习器

Zekun Wang, Anant Gupta, Christopher J. MacLellan

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出信任区域持续学习,通过结合生成重放和Fisher度量信任区域约束,实现隐式元学习效果,在任务增量扩散图像生成和持续扩散策略控制中取得最佳性能。

Comments 21 pages, 21 tables

详情
AI中文摘要

持续学习旨在顺序获取任务而不发生灾难性遗忘,但标准策略面临核心权衡:基于正则化的方法(如EWC)在任务最优值弱重叠时可能过度约束更新,而基于重放的方法可以保持性能但因不完美重放而漂移。我们研究了一种混合视角:\emph{信任区域持续学习},它将生成重放与Fisher度量信任区域约束相结合。我们证明,在局部近似下,得到的更新具有MAML风格的解释,包含一个隐式内步:重放提供旧任务梯度信号(类似查询),而Fisher加权惩罚提供高效的离线曲率塑造(类似支持)。这产生了持续学习中的涌现元学习特性:模型成为初始化,在每次任务转换后快速\emph{重新收敛}到先前任务最优值,而无需显式优化双层目标。实验上,在任务增量扩散图像生成和持续扩散策略控制中,信任区域持续学习实现了最佳最终性能和保留,并且比EWC、重放和持续元学习基线更快地恢复早期任务性能。

英文摘要

Continual learning aims to acquire tasks sequentially without catastrophic forgetting, yet standard strategies face a core tradeoff: regularization-based methods (e.g., EWC) can overconstrain updates when task optima are weakly overlapping, while replay-based methods can retain performance but drift due to imperfect replay. We study a hybrid perspective: \emph{trust region continual learning} that combines generative replay with a Fisher-metric trust region constraint. We show that, under local approximations, the resulting update admits a MAML-style interpretation with a single implicit inner step: replay supplies an old-task gradient signal (query-like), while the Fisher-weighted penalty provides an efficient offline curvature shaping (support-like). This yields an emergent meta-learning property in continual learning: the model becomes an initialization that rapidly \emph{re-converges} to prior task optima after each task transition, without explicitly optimizing a bilevel objective. Empirically, on task-incremental diffusion image generation and continual diffusion-policy control, trust region continual learning achieves the best final performance and retention, and consistently recovers early-task performance faster than EWC, replay, and continual meta-learning baselines.

2602.02259 2026-05-28 cs.LG cs.CV

Segment to Focus: Guiding Latent Action Models in the Presence of Distractors

聚焦分割:在干扰物存在下引导潜在动作模型

Marcus Fechner, Hamza Adnan, Constantin C. Lüth, Matthew T. Jackson, Alexey Zakharov, J. Marius Zöllner

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) University of Oxford(牛津大学)

AI总结 针对动作相关视觉干扰导致潜在动作模型失效的问题,提出MaskLAM方法,利用分割基础模型(如SAM)零样本获取智能体掩码,限制重建目标于智能体像素,迫使潜在动作编码内源动态,显著提升下游策略性能。

详情
AI中文摘要

潜在动作模型(LAMs)为在大规模无动作视频上预训练具身智能体提供了一条有前景的路径。它们推断连续观测之间的潜在动作,之后可以使用少量标签解码为真实动作。然而,近期工作表明,在真实世界视频中常见的动作相关视觉干扰物(如动态背景、相机抖动或其他移动物体)存在时,这一方法会失败。在这些场景中,标准重建目标会驱使潜在动作编码外源运动而非智能体控制的动态,导致微调后的策略性能不佳。然而,我们观察到内源和外源因素通常在像素空间中是空间分离的:控制相关的变化集中在智能体上,而干扰物运动发生在别处。我们利用这一观察,将重建目标限制在智能体像素上,迫使潜在动作解释智能体控制的动态而非外源动态。我们将该方法称为MaskLAM;它从现成的分割基础模型(如SAM)中零样本获取智能体掩码,并且在预训练期间不需要架构更改、辅助损失或动作标签。在两个连续控制基准(Distracting Control Suite、Distracting Meta-World)上,MaskLAM将归一化线性探针MSE降低了最多$3.51 imes$,并将归一化回报提高了最多$4.97 imes$,相比LAPO,同时缩小了与依赖真实动作监督的LAOM-Labels之间的差距。

英文摘要

Latent action models (LAMs) offer a promising path to pre-training embodied agents on large amounts of action-free video. They infer latent actions between consecutive observations that can later be decoded to ground-truth actions using a small number of labels. However, recent work has shown that this recipe fails in the presence of action-correlated visual distractors common in real-world video, such as dynamic backgrounds, camera shake, or other moving objects. In these scenarios, the standard reconstruction objective drives latent actions to encode exogenous motion instead of agent-controlled dynamics, resulting in policies that underperform when fine-tuned. We observe, however, that endogenous and exogenous factors are typically spatially separated in pixel space: control-relevant change is concentrated on the agent, while distractor motion occurs elsewhere. We exploit this observation by restricting the reconstruction objective to agent pixels, forcing latent actions to explain agent-controlled dynamics rather than exogenous ones. We call this method MaskLAM; it obtains the agent mask zero-shot from off-the-shelf segmentation foundation models (e.g., SAM) and requires no architectural changes, auxiliary losses, or action labels during pre-training. Across two continuous-control benchmarks (Distracting Control Suite, Distracting Meta-World), MaskLAM reduces normalized linear-probe MSE by up to $3.51\times$ and improves normalized return by up to $4.97\times$ over LAPO, while narrowing the gap to LAOM-Labels, which relies on ground-truth action supervision.

2602.02150 2026-05-28 cs.LG cs.AI

ECHO: Entropy-Confidence Hybrid Optimization for Test-Time Reinforcement Learning

ECHO: 测试时强化学习的熵-置信度混合优化

Chu Zhao, Enneng Yang, Yuting Liu, Jianzhe Zhao, Guibing Guo

发表机构 * Northeastern University, Shenyang, China(东北大学(沈阳)) Shenzhen Campus of Sun Yat-sen University, China(中山大学深圳校区)

AI总结 针对测试时强化学习中高熵分支导致rollout崩溃和早期伪标签噪声引发过拟合的问题,提出熵-置信度混合组相对策略优化(ECHO),通过自适应分支控制和置信度剪枝缓解崩溃,并采用置信度自适应裁剪和优势塑造增强训练鲁棒性。

Comments 19 ppages

详情
AI中文摘要

测试时强化学习通过重复rollout生成多个候选答案,并利用多数投票构建的伪标签进行在线更新。为了减少开销并改进探索,先前的工作引入了树结构rollout,共享推理前缀并在关键节点分支以提高采样效率。然而,这种范式仍然面临两个挑战:(1) 高熵分支可能触发rollout崩溃,即分支预算集中在少数具有连续高熵片段的轨迹上,迅速减少有效分支数量;(2) 早期伪标签存在噪声和偏差,可能引发自我强化的过拟合,导致策略过早锐化并抑制探索。为了解决这些问题,我们提出了熵-置信度混合组相对策略优化(ECHO)。在rollout过程中,ECHO联合利用局部熵和组级置信度自适应控制分支宽度,并进一步引入在线置信度剪枝以终止持续低置信度的分支,避免高熵陷阱并缓解崩溃。在策略更新过程中,ECHO采用置信度自适应裁剪和熵-置信度混合优势塑造方法,以增强训练鲁棒性并减轻早期偏差。实验表明,ECHO在多个数学和视觉推理基准上取得了一致的性能提升,并在有限的rollout预算下更有效地泛化。

英文摘要

Test-time reinforcement learning generates multiple candidate answers via repeated rollouts and performs online updates using pseudo-labels constructed by majority voting. To reduce overhead and improve exploration, prior work introduces tree structured rollouts, which share reasoning prefixes and branch at key nodes to improve sampling efficiency. However, this paradigm still faces two challenges: (1) high entropy branching can trigger rollout collapse, where the branching budget concentrates on a few trajectories with consecutive high-entropy segments, rapidly reducing the number of effective branches; (2) early pseudo-labels are noisy and biased, which can induce self-reinforcing overfitting, causing the policy to sharpen prematurely and suppress exploration. To address these issues, we propose Entropy Confidence Hybrid Group Relative Policy Optimization (ECHO). During rollout, ECHO jointly leverages local entropy and group level confidence to adaptively control branch width, and further introduces online confidence-based pruning to terminate persistently low confidence branches, avoiding high entropy traps and mitigating collapse. During policy updates, ECHO employs confidence adaptive clipping and an entropy confidence hybrid advantage shaping approach to enhance training robustness and mitigate early stage bias. Experiments demonstrate that ECHO achieves consistent gains on multiple mathematical and visual reasoning benchmarks, and generalizes more effectively under a limited rollout budget.

2602.01990 2026-05-28 cs.LG cs.AI

SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning

SAME: 用于多模态持续指令调优的稳定混合专家模型

Zhen-Hao Xie, Jun-Tao Tang, Yu-Cheng Shi, Han-Jia Ye, De-Chuan Zhan, Da-Wei Zhou

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室) School of Artificial Intelligence, Nanjing University, China(南京大学人工智能学院)

AI总结 针对多模态持续指令调优中专家路由漂移和专家漂移问题,提出稳定混合专家模型(SAME),通过正交子空间分解路由动态和曲率感知缩放更新专家,实现无重放的状态最优性能。

Comments Accepted to ICML 2026. Code is available at https://github.com/LAMDA-CL/Prism

详情
AI中文摘要

多模态大语言模型(MLLMs)通过指令调优实现了强大的性能,但实际部署需要它们持续扩展能力,这使得多模态持续指令调优(MCIT)变得至关重要。最近的方法利用稀疏专家路由来促进任务专业化,但我们发现专家路由过程会随着数据分布的演变而发生漂移。例如,之前激活定位专家的接地查询在学习OCR任务后可能被路由到不相关的专家。同时,与接地相关的专家可能被新任务覆盖而失去原有功能。这种失败反映了两个问题:路由器漂移(专家选择随时间变得不一致)和专家漂移(共享专家跨任务被覆盖)。因此,我们提出了用于MCIT的稳定混合专家模型(SAME)。为了解决路由器漂移,SAME通过将路由动态分解为正交子空间并仅更新任务相关方向来稳定专家选择。为了缓解专家漂移,我们通过使用历史输入协方差进行曲率感知缩放来调节专家更新,无需重放。SAME还引入了自适应专家激活,在训练期间冻结选中的专家,减少冗余计算和跨任务干扰。我们还引入了一个新的基准来评估长任务序列的MCIT,大量实验证明了SAME的最优性能。代码可在 https://github.com/LAMDA-CL/Prism 获取。

英文摘要

Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually expand their capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. Recent methods leverage sparse expert routing to promote task specialization, but we find that the expert routing process suffers from drift as the data distribution evolves. For example, a grounding query that previously activated localization experts may instead be routed to irrelevant experts after learning OCR tasks. Meanwhile, the grounding-related experts can be overwritten by new tasks and lose their original functionality. Such failure reflects two problems: router drift, where expert selection becomes inconsistent over time, and expert drift, where shared experts are overwritten across tasks. Therefore, we propose StAbilized Mixture-of-Experts (SAME) for MCIT. To address router drift, SAME stabilizes expert selection by decomposing routing dynamics into orthogonal subspaces and updating only task-relevant directions. To mitigate expert drift, we regulate expert updates via curvature-aware scaling using historical input covariance in a rehearsal-free manner. SAME also introduces adaptive expert activation to freeze selected experts during training, reducing redundant computation and cross-task interference. We also introduce a new benchmark to evaluate MCIT with long task sequence, and extensive experiments demonstrate SAME's SOTA performance. Code is available at https://github.com/LAMDA-CL/Prism.

2602.01745 2026-05-28 cs.LG cs.AI

Probability-Entropy Calibration: An Elastic Indicator for Adaptive Fine-tuning

概率-熵校准:一种用于自适应微调的弹性指标

Wenhao Yu, Shaohang Wei, Jiahong Liu, Yifan Li, Minda Hu, Aiwei Liu, Hao Zhang, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) Peking University(北京大学) Tsinghua University(清华大学) University of the Chinese Academy of Sciences(中国科学院大学)

AI总结 提出概率-熵校准信号(相对排名指标)进行token级重加权,以平衡预训练先验与下游对齐,在数学推理、分布外推理和代码生成任务上优于仅基于概率或熵的方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

Token级重加权是一种简单但有效的控制监督微调的机制,但常见的指标很大程度上是单维的:真实概率反映下游对齐,而token熵反映预训练先验引起的内在不确定性。忽略熵可能会将噪声或易替换的token误识别为学习关键,而忽略概率则无法反映目标特定的对齐。RankTuner引入了一种概率-熵校准信号,即相对排名指标,它比较真实token的排名与其在预测分布下的预期排名。逆指标作为token级的相对尺度用于重加权微调目标,将更新集中在真正未学习充分的token上,而不过度惩罚内在不确定的位置。在多个骨干网络上的实验表明,在数学推理基准上持续改进,在分布外推理上获得迁移增益,并且在代码生成性能上优于仅基于概率或熵的重加权基线。

英文摘要

Token-level reweighting is a simple yet effective mechanism for controlling supervised fine-tuning, but common indicators are largely one-dimensional: the ground-truth probability reflects downstream alignment, while token entropy reflects intrinsic uncertainty induced by the pre-training prior. Ignoring entropy can misidentify noisy or easily replaceable tokens as learning-critical, while ignoring probability fails to reflect target-specific alignment. RankTuner introduces a probability--entropy calibration signal, the Relative Rank Indicator, which compares the rank of the ground-truth token with its expected rank under the prediction distribution. The inverse indicator is used as a token-wise Relative Scale to reweight the fine-tuning objective, focusing updates on truly under-learned tokens without over-penalizing intrinsically uncertain positions. Experiments on multiple backbones show consistent improvements on mathematical reasoning benchmarks, transfer gains on out-of-distribution reasoning, and pre code generation performance over probability-only or entropy-only reweighting baselines.

2510.02174 2026-05-28 cs.LG math.OC math.PR stat.ML

Flatness-Aware Stochastic Gradient Langevin Dynamics

平坦感知随机梯度Langevin动力学

Stefano Bruno, Youngsik Hwang, Jaehyeon An, Sotirios Sabanis, Dong-Young Lim

发表机构 * UNIST InnoCORE AI-Space Solar Initiative, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea(UNIST InnoCORE AI-Space Solar Initiative,乌山国立科学与技术研究所(UNIST),乌山,44919,韩国) Artificial Intelligence Graduate School, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea(人工智能研究生院,乌山国立科学与技术研究所(UNIST),乌山,44919,韩国) Department of Industrial Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea(工业工程系,乌山国立科学与技术研究所(UNIST),乌山,44919,韩国) School of Mathematics, University of Edinburgh, Edinburgh, United Kingdom(爱丁堡大学数学学院,爱丁堡,英国) Department of Mathematics, National Technical University of Athens, Athens, Greece(雅典国家技术大学数学系,雅典,希腊) Archimedes, Athena Research and Innovation Centre, Marousi, Greece(Archimedes,雅典研究与创新中心,Marousi,希腊)

AI总结 提出平坦感知随机梯度Langevin动力学(fSGLD),通过理论规定的噪声尺度与逆温度耦合,在保持计算效率的同时偏向平坦盆地,并提供非渐近理论分析和实验验证。

Comments Accepted by ICML 2026

Journal ref ICML 2026

详情
AI中文摘要

损失景观的平坦性已被广泛研究,作为理解深度学习算法行为和泛化的重要视角。受此观点启发,我们提出了平坦感知随机梯度Langevin动力学(fSGLD),这是一种一阶优化方法,在保持SGD和SGLD的计算和内存效率的同时,使其动力学偏向平坦盆地。我们提供了非渐近理论分析,表明在理论上规定的噪声尺度$σ$和逆温度$β$之间的耦合下,fSGLD以平坦偏差的吉布斯分布为目标,并给出了显式的过剩风险保证。我们在标准优化器基准、贝叶斯图像分类、不确定性量化和分布外检测上对fSGLD进行了实证评估,展示了持续强劲的性能和可靠的不确定性估计。额外实验证实了理论上规定的$β$-$σ$耦合相对于解耦选择的有效性。

英文摘要

Flatness of the loss landscape has been widely studied as an important perspective for understanding the behavior and generalization of deep learning algorithms. Motivated by this view, we propose Flatness-Aware Stochastic Gradient Langevin Dynamics (fSGLD), a first-order optimization method that biases learning its dynamics toward flat basins while retaining the computational and memory efficiency of SGD and SGLD. We provide a non-asymptotic theoretical analysis showing that fSGLD targets a flatness-biased Gibbs distribution under a theoretically prescribed coupling between the noise scale $σ$ and the inverse temperature $β$, together with explicit excess risk guarantees. We empirically evaluate fSGLD across standard optimizer benchmarks, Bayesian image classification, uncertainty quantification, and out-of-distribution detection, demonstrating consistently strong performance and reliable uncertainty estimates. Additional experiments confirm the effectiveness of the theoretically prescribed $β$-$σ$ coupling compared to decoupled choices.

2509.23074 2026-05-28 cs.LG cs.AI

Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting

超越模型排名:时间序列预测的可预测性对齐评估

Wanjin Feng, Yuan Yuan, Jingtao Ding, Yong Li

发表机构 * Department of Electronic Engineering, Tsinghua University, Beijing, China.(清华大学电子工程系,北京,中国)

AI总结 针对基准排行榜评估混淆模型性能与数据内在不可预测性的问题,提出基于谱相干的可预测性对齐诊断框架,包含SCP分数和LUR工具,揭示可预测性漂移和模型架构权衡。

详情
AI中文摘要

在时间序列预测的AI模型日益复杂的时代,进展通常通过基准排行榜上的边际改进来衡量。然而,这种方法存在一个根本缺陷:标准评估指标混淆了模型的性能与数据的内在不可预测性。为了解决这一紧迫挑战,我们引入了一个新颖的、基于谱相干的可预测性对齐诊断框架。我们的框架有两个主要贡献:谱相干可预测性(SCP),一个计算高效($O(N\log N)$)且任务对齐的分数,用于量化给定预测实例的固有难度;以及线性利用率(LUR),一个频率分辨的诊断工具,精确测量模型如何有效利用数据中的线性可预测信息。我们验证了框架的有效性,并利用它揭示了两个核心见解。首先,我们提供了“可预测性漂移”的首个系统性证据,表明任务的预测难度随时间剧烈变化。其次,我们的评估揭示了一个关键的架构权衡:复杂模型在低可预测性数据上表现优越,而线性模型在更可预测的任务上非常有效。我们倡导范式转变,超越简单的聚合分数,转向更具洞察力的、可预测性感知的评估,从而促进更公平的模型比较和更深入的模型行为理解。

英文摘要

In the era of increasingly complex AI models for time series forecasting, progress is often measured by marginal improvements on benchmark leaderboards. However, this approach suffers from a fundamental flaw: standard evaluation metrics conflate a model's performance with the data's intrinsic unpredictability. To address this pressing challenge, we introduce a novel, predictability-aligned diagnostic framework grounded in spectral coherence. Our framework makes two primary contributions: the Spectral Coherence Predictability (SCP), a computationally efficient ($O(N\log N)$) and task-aligned score that quantifies the inherent difficulty of a given forecasting instance, and the Linear Utilization Ratio (LUR), a frequency-resolved diagnostic tool that precisely measures how effectively a model exploits the linearly predictable information within the data. We validate our framework's effectiveness and leverage it to reveal two core insights. First, we provide the first systematic evidence of "predictability drift", demonstrating that a task's forecasting difficulty varies sharply over time. Second, our evaluation reveals a key architectural trade-off: complex models are superior for low-predictability data, whereas linear models are highly effective on more predictable tasks. We advocate for a paradigm shift, moving beyond simplistic aggregate scores toward a more insightful, predictability-aware evaluation that fosters fairer model comparisons and a deeper understanding of model behavior.

2602.01203 2026-05-28 cs.CL cs.LG

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

注意力汇聚在注意力层中锻造原生MoE:针对头部坍塌的汇聚感知训练

Zizhuo Fu, Wenxuan Zeng, Runsheng Wang, Meng Li

发表机构 * Institute for Artificial Intelligence, Peking University, Beijing(人工智能研究院,北京大学,北京) School of Integrated Circuits, Peking University, Beijing(集成电路学院,北京大学,北京)

AI总结 本文通过理论和实证证明注意力汇聚自然构建了注意力层内的混合专家机制,并提出汇聚感知训练算法以缓解头部坍塌问题,提升模型性能。

Comments 2026 International Conference on Machine Learning (ICML)

详情
AI中文摘要

大型语言模型(LLMs)通常将不成比例的注意力分配给第一个标记,这种现象称为注意力汇聚。最近的几种方法旨在解决这个问题,包括GPT-OSS中的汇聚注意力和Qwen3-Next中的门控注意力。然而,缺乏对这些注意力机制之间关系的全面分析。在这项工作中,我们提供了理论和实证证据,表明普通注意力和汇聚注意力中的汇聚自然地在注意力层内构建了混合专家(MoE)机制。这一见解解释了先前工作中观察到的头部坍塌现象,即只有固定子集的注意力头对生成有贡献。为了缓解头部坍塌,我们提出了一种汇聚感知训练算法,该算法带有专为注意力层设计的辅助负载平衡损失。大量实验表明,我们的方法在普通注意力、汇聚注意力和门控注意力上实现了有效的头部负载平衡,并提高了模型性能。我们希望这项研究能为注意力机制提供新的视角,并鼓励进一步探索注意力层内固有的MoE结构。

英文摘要

Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. Extensive experiments show that our method achieves effective head load balancing and improves model performance across Vanilla Attention, Sink Attention, and Gated Attention. We hope this study offers a new perspective on attention mechanisms and encourages further exploration of the inherent MoE structure within attention layers.

2512.14340 2026-05-28 cs.RO

Field evaluation and optimization of a lightweight autonomous lidar-based UAV system based on a rigorous experimental setup in boreal forest environments

基于严格实验设置的轻量级自主激光雷达无人机系统在北方森林环境中的现场评估与优化

Aleksi Karhunen, Teemu Hakala, Väinö Karjalainen, Eija Honkavaara

发表机构 * Finnish Geospatial Research Institute in National Land Survey of Finland(芬兰地理研究 institute 在芬兰国家土地测绘局)

AI总结 提出标准化实验设置评估自主林下无人机系统,通过轻量级激光雷达四旋翼在北方森林中的93次真实飞行验证,优化后系统在中难度森林中1m/s和2m/s速度下成功率分别为12/15和15/15,在困难森林中为12/15和5/15。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

近年来,利用自主无人机进行林下森林遥感引起了越来越多的兴趣,导致科学文献中发表了大量自主飞行算法。为了支持此类算法的选择和开发,基于已发表研究对现有方法进行可靠比较至关重要。然而,由于实验设置差异很大且报告实践不完整,目前可靠比较面临挑战。本研究提出了一种标准化的实验设置,用于评估自主林下无人机系统,以填补这一空白。所提出的设置强调森林复杂性的定量报告、测试环境的可视化表示、多次重复飞行的执行,以及飞行成功率与定性飞行结果的报告。此外,鼓励在多个目标速度下飞行,并报告实际飞行速度、任务完成时间和点对点飞行距离。该设置通过采用最先进开源算法的轻量级激光雷达四旋翼进行演示,并在两个天然北方森林环境中进行了大量实验评估。基于对原始系统的系统评估,引入了若干改进。随后对优化后的系统重复相同的实验协议,总共进行了93次真实世界飞行。优化后的系统在中难度森林中,目标飞行速度为1 m/s和2 m/s时分别实现了12/15和15/15的成功率,在困难森林中分别为12/15和5/15。采用所提出的实验设置将有助于基于文献的自主林下飞行系统比较,并支持未来基于无人机的森林机器人解决方案的系统性能改进。

英文摘要

Interest in utilizing autonomous uncrewed aerial vehicles (UAVs) for under-canopy forest remote sensing has increased in recent years, resulting in the publication of numerous autonomous flight algorithms in the scientific literature. To support the selection and development of such algorithms, a reliable comparison of existing approaches based on published studies is essential. However, reliable comparisons are currently challenging due to widely varying experimental setups and incomplete reporting practices. This study proposes a standardized experimental setup for evaluating autonomous under-canopy UAV systems to fill this gap. The proposed setup emphasizes quantitative reporting of forest complexity, visual representation of test environments, execution of multiple repeated flights, and reporting of flight success rates alongside qualitative flight results. In addition, flights at multiple target speeds are encouraged, with reporting of realized flight speed, mission completion time, and point-to-point flight distance. The proposed setup is demonstrated using a lightweight lidar-based quadrotor employing state-of-the-art open-source algorithms, evaluated through extensive experiments in two natural boreal forest environments. Based on a systematic evaluation of the original system, several improvements were introduced. The same experimental protocol was then repeated with the optimized system, resulting in a total of 93 real-world flights. The optimized system achieved success rates of 12/15 and 15/15 at target flight speeds of 1 m/s and 2 m/s, respectively, in a medium-difficulty forest, and 12/15 and 5/15 in a difficult forest. Adoption of the proposed experimental setup would facilitate the literature-based comparison of autonomous under-canopy flight systems and support systematic performance improvement of future UAV-based forest robotics solutions.

2601.23262 2026-05-28 cs.LG

Particle-Guided Diffusion Models for Partial Differential Equations

粒子引导的偏微分方程扩散模型

Andrew Millard, Fredrik Lindsten, Zheng Zhao

发表机构 * Department of Computer and Information Science, Linköping University, Linköping, Sweden(计算机与信息科学系,林雪平大学,林雪平,瑞典)

AI总结 提出一种粒子引导的随机采样方法,结合扩散模型与基于PDE残差和观测约束的物理引导,通过序贯蒙特卡洛框架实现可扩展的生成式PDE求解器,在多个基准和多物理场系统中数值误差低于现有方法。

详情
AI中文摘要

我们引入了一种引导随机采样方法,该方法通过来自偏微分方程残差和观测约束的物理引导来增强扩散模型的采样,确保生成的样本保持物理可行性。我们将此采样过程嵌入到一个新的序贯蒙特卡洛框架中,从而得到一个可扩展的生成式PDE求解器。在多个基准PDE系统以及多物理场和相互作用PDE系统中,我们的方法产生的解场数值误差低于现有最先进的生成方法。

英文摘要

We introduce a guided stochastic sampling method that augments sampling from diffusion models with physics-based guidance derived from partial differential equation (PDE) residuals and observational constraints, ensuring generated samples remain physically admissible. We embed this sampling procedure within a new Sequential Monte Carlo (SMC) framework, yielding a scalable generative PDE solver. Across multiple benchmark PDE systems as well as multiphysics and interacting PDE systems, our method produces solution fields with lower numerical error than existing state-of-the-art generative methods.

2510.08525 2026-05-28 cs.CL

Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

哪些注意力头对推理重要?RL引导的KV缓存压缩

Wenjie Du, Li Jiang, Keda Tao, Xue Liu, Huan Wang

发表机构 * Westlake University(西华大学) McGill University(麦吉尔大学) Mila - Quebec AI Institute(魁北克AI研究院) Zhejiang University(浙江大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德智能大学)

AI总结 提出RLKV方法,利用强化学习识别对推理质量关键的注意力头,并对其保留完整KV缓存而对其他头进行激进压缩,实现20-60%缓存减少且性能近乎无损。

详情
AI中文摘要

推理型大语言模型通过扩展的思维链生成展现出复杂的推理行为,这些行为在解码过程中对信息损失高度敏感,给KV缓存压缩带来了关键挑战。现有的token丢弃方法通过移除中间步骤直接破坏推理链,而为检索任务设计的头重分配方法无法保留对生成推理至关重要的注意力头。然而,现有方法均无法识别哪些注意力头真正维持推理一致性并控制生成终止。为解决此问题,我们提出RLKV,它使用强化学习作为探针,通过直接优化注意力头缓存使用与实际生成结果的关系,发现哪些头对推理质量有贡献。这一发现自然引出了高效的压缩策略:我们对推理关键的头分配完整KV缓存,同时对其他头使用固定大小的KV缓存进行激进压缩。实验表明,少数头对推理至关重要,使得在多种任务和模型上实现20-60%的缓存减少且性能近乎无损,在60%压缩率下实现高达2.06倍的端到端加速。

英文摘要

Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Existing token-dropping methods directly disrupt reasoning chains by removing intermediate steps, while head-reallocation methods, designed for retrieval tasks, fail to preserve the heads essential for generative reasoning. However, no existing method can identify which attention heads genuinely maintain reasoning consistency and control generation termination. To address this, we propose RLKV, which uses reinforcement learning as a probe to discover which heads contribute to reasoning quality by directly optimizing their cache usage against actual generation outcomes. This discovery naturally leads to an efficient compression strategy: we allocate full KV cache to reasoning-critical heads while aggressively compressing others with constant-size KV cache. Experiments reveal that a fraction of heads proves essential for reasoning, enabling 20--60% cache reduction with near-lossless performance across diverse tasks and models, and up to 2.06x end-to-end speedup at 60% reduction.

2507.16679 2026-05-28 cs.CL cs.AI cs.CY

PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

PICACO: 通过总相关优化实现大语言模型的多元情境价值对齐

Han Jiang, Dongyao Zhu, Xiaoyuan Yi, Ziang Xiao, Zhihua Wei, Xing Xie

发表机构 * Johns Hopkins University, Baltimore, MD, USA(约翰霍普金斯大学) North Carolina State University, Raleigh, NC, USA(北卡罗来纳州立大学) Microsoft Research Asia, Beijing, China(微软亚洲研究院) Tongji University, Shanghai, China(同济大学)

AI总结 针对情境对齐中价值冲突导致的指令瓶颈问题,提出PICACO方法,通过优化元指令并最大化指定价值与模型响应的总相关,无需微调即可实现多元价值平衡对齐。

Comments ICML 2026

详情
AI中文摘要

情境学习在使大语言模型与人类价值对齐方面展现出巨大潜力,有助于减少有害输出并适应多样化偏好,而无需昂贵的后训练,这被称为情境对齐。然而,大语言模型对输入提示的理解仍是不可知的,限制了情境对齐处理价值冲突的能力——人类价值本质上是多元的,常常施加相互冲突的要求,例如刺激与传统。因此,当前的情境对齐方法面临指令瓶颈挑战,即大语言模型难以在单个提示中协调多个预期价值,导致对齐不完整或有偏。为了解决这个问题,我们提出了PICACO,一种新颖的多元情境对齐方法。无需微调,PICACO优化一个融合了多个价值的元指令,以更好地激发大语言模型对这些价值的理解并改进对齐。这是通过最大化指定价值与大语言模型响应之间的总相关来实现的,这从理论上强化了价值一致性并减少了干扰噪声,从而产生更有效的指令。在五个价值集上的大量实验表明,PICACO在黑盒和开源大语言模型上均表现良好,优于多个近期强基线,并在多达8个不同价值之间实现了更好的平衡。

英文摘要

In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions--human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that incorporates multiple values to better elicit LLMs' understanding of them and improve alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, which theoretically reinforces value conformity and reduces distractive noise, resulting in more effective instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.

2601.21666 2026-05-28 cs.AI cs.CV

SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

SONIC-O1:用于评估多模态大语言模型在音视频理解上的真实世界基准

Ahmed Y. Radwan, Christos Emmanouilidis, Hina Tabassum, Deval Pandya, Shaina Raza

发表机构 * Vector Institute for Artificial Intelligence(向量人工智能研究所) University of Groningen(Groningen大学) York University(约克大学)

AI总结 提出SONIC-O1基准,包含60小时人工验证的音视频数据,评估多模态大语言模型在开放摘要、多项选择问答和时序定位上的能力,发现模型在时序定位上存在显著性能差距和人口统计偏差。

详情
AI中文摘要

多模态大语言模型(MLLMs)是近期AI研究的主要焦点。然而,大多数先前工作集中于静态图像理解,而它们处理序列音视频数据的能力仍未充分探索。这一差距凸显了需要一个高质量基准来系统评估MLLM在真实世界场景中的性能。我们介绍了SONIC-O1,一个全面的、完全人工验证的基准,包含60小时(231个片段)跨越13个真实世界对话领域的数据,带有4,958个注释和人口统计元数据。SONIC-O1评估三种能力:开放摘要、多项选择题(MCQ)回答以及带有支持理由(推理)的时序定位。在闭源和开源模型中,我们发现MCQ准确率显示模型家族之间的差距最小,但最好的闭源模型在时序定位上比最好的开源模型高出22.6%。我们进一步观察到不同人口统计组在时序定位上的准确率差距高达21.4%,表明模型行为存在持续差异。SONIC-O1为基于时序和人口统计鲁棒的多模态理解提供了一个开放评估套件。SONIC-O1公开可用于研究:项目页面(https://vectorinstitute.github.io/sonic-o1/)、数据集(https://huggingface.co/datasets/vector-institute/sonic-o1)、GitHub(https://github.com/vectorinstitute/sonic-o1)、排行榜(https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard)。

英文摘要

Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark of 60 hours (231 clips) spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates three capabilities: open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Across closed- and open-source models, we find that the MCQ accuracy shows the smallest gap between model families, but the best closed-source model outperforms the best open-source model by 22.6% on temporal localization. We further observe accuracy gaps of up to 21.4% on temporal localization across demographic groups, indicating persistent disparities in model behaviour. SONIC-O1 provides an open evaluation suite for temporally grounded and demographically robust multimodal understanding. SONIC-O1 is publicly available for research: Project page (https://vectorinstitute.github.io/sonic-o1/), Dataset (https://huggingface.co/datasets/vector-institute/sonic-o1), GitHub (https://github.com/vectorinstitute/sonic-o1), Leaderboard (https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard).

2601.21167 2026-05-28 cs.LG

Learning What to Recommend: Minimax Optimal Simple Regret in Logistic Bandits

学习推荐什么:逻辑斯蒂老虎机中极小化最优简单遗憾

Shuai Liu, Alireza Bakhtiari, Alex Ayoub, Botao Hao, Csaba Szepesvári

发表机构 * University of Alberta(阿尔伯塔大学) University of Washington, Seattle(华盛顿大学(西雅图)) OpenAI(开放人工智能研究所)

AI总结 针对简单遗憾目标下的随机逻辑斯蒂老虎机,提出两种曲率感知算法(MULog和THATS),实现与下界匹配的遗憾上界,并揭示最优动作处sigmoid逆斜率κ_*决定极小化难度。

详情
AI中文摘要

我们研究在简单遗憾目标下具有$d$维动作特征的随机逻辑斯蒂老虎机,其中学习者使用$T$轮探索输出单个最终动作。逻辑斯蒂结构在此至关重要:因为动作的信息量取决于sigmoid的局部曲率,对即时奖励最优的动作不一定对识别最佳最终推荐最有用。我们表明一阶极小化难度由$κ_*$(sigmoid在最优动作处的逆斜率)主导。下界由一个移位饱和困难族实现,其中饱和同时限制了关于最终决策的可用信息,并控制了错误推荐的价值损失。这揭示了一种与累积遗憾构造不同的困难机制,尽管在线到批处理归约在期望上恢复了相同的领先阶。然后我们开发了两种曲率感知算法:\MULog,一种纯探索方法,其最终推荐满足阶为$ ilde O(d/\sqrt{κ_* T})$的高概率上界,与下界匹配至对数因子;以及\THATS,一种汤普森采样风格的方法,提供了计算上更轻的替代方案。在困难和简单几何上的实验支持相同的图景:信息性低奖励动作可以使实例显著更容易,而曲率感知方法特别有效地利用了这种结构。

英文摘要

We study stochastic logistic bandits with $d$-dimensional action features under the simple-regret objective, where a learner uses $T$ rounds of exploration to output a single final action. The logistic structure is essential here: because the informativeness of an action depends on the local curvature of the sigmoid, actions that are best for immediate reward need not be the most useful for identifying the best final recommendation. We show that the first-order minimax difficulty is governed by $κ_*$, the inverse slope of the sigmoid at the optimal action. The lower bound is realized by a shifted saturated hard family in which saturation simultaneously limits the information available about the final decision and controls the value loss from a wrong recommendation. This reveals a hard mechanism distinct from cumulative-regret constructions, even though online-to-batch reductions recover the same leading order in expectation. We then develop two curvature-aware algorithms: \MULog, a pure-exploration method whose final recommendation satisfies a high-probability upper bound of order $\tilde O(d/\sqrt{κ_* T})$, matching the lower bound up to logarithmic factors, and \THATS, a Thompson-sampling-style method that provides a computationally lighter alternative. Experiments on both hard and easy geometries support the same picture: informative low-reward actions can make instances substantially easier, and the curvature-aware methods exploit this structure especially effectively.

2510.11234 2026-05-28 cs.LG

Neural Weight Compression for Language Models

语言模型的神经权重压缩

Jegwang Ryu, Minkyu Kim, Seungjun Shin, Hee Min Choi, Dokwan Oh, Jaeho Lee

发表机构 * POSTECH(POSTECH大学) Samsung Electronics Co., Ltd(三星电子公司)

AI总结 提出神经权重压缩(NWC)框架,通过训练神经编解码器在预训练权重数据集上实现高效压缩,解决张量异质性和重建损失与下游性能不匹配问题,在4-6比特区间取得优异精度-压缩权衡。

详情
AI中文摘要

随着模型规模和部署的增长,语言模型权重的高效压缩变得越来越关键。然而,现有大多数方法依赖于手工设计的变换和启发式方法,反映出对权重作为数据模态的理解有限。为了超越这一范式,我们将权重压缩公式化为神经编解码器学习,并提出了神经权重压缩(NWC),一个在预训练权重数据集上训练神经编解码器的框架。NWC解决了权重压缩固有的挑战,包括张量异质性和重建损失与下游性能之间的不匹配。实验表明,NWC实现了极具竞争力的精度-压缩权衡,在4-6比特区间内尤其强劲,且不依赖刚性的手工设计组件(如Hadamard变换)。这些优势扩展到不同架构,例如视觉编码器。我们的分析强调了熵约束量化和学习变换在使压缩适应权重数据和下游任务中的作用。

英文摘要

Efficient compression of language model weights is increasingly critical as model scale and deployment grow. Yet, most existing methods rely on handcrafted transforms and heuristics, reflecting the limited understanding of weights as a data modality. To move beyond this paradigm, we formulate weight compression as neural codec learning and propose Neural Weight Compression (NWC), a framework for training neural codecs on pretrained weight datasets. NWC addresses challenges intrinsic to weight compression, including tensor heterogeneity and the mismatch between reconstruction losses and downstream performance. Experiments show that NWC achieves highly competitive accuracy-compression tradeoffs, with particularly strong results in the 4-6 bit regime, without relying on rigid handcrafted components such as the Hadamard transform. These gains extend to across diverse architectures, e.g., vision encoders. Our analysis highlights the roles of entropy-constrained quantization and learned transforms in adapting compression to weight data and downstream tasks.

2601.19926 2026-05-28 cs.CL cs.AI

The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models

Transformer的语法:语言模型中句法知识可解释性研究的系统综述

Nora Graichen, Iria de-Dios-Flores, Gemma Boleda

发表机构 * Universitat Pompeu Fabra(巴塞罗那庞培乌法布拉大学) ICREA(加泰罗尼亚国家研究委员会)

AI总结 通过对337篇文章的系统综述,评估基于Transformer的语言模型(TLM)的句法能力,发现TLM编码了非平凡的句法知识,但句法-语义接口现象表现较弱,且研究集中在英语和BERT类模型上。

详情
AI中文摘要

我们对337篇评估基于Transformer的语言模型(TLM)句法能力的文章进行了系统综述,报告了涵盖广泛句法现象、语言、模型和方法的3000多个数据点。这些数据共同表明,TLM编码了非平凡的句法知识。行为证据显示,TLM在形式句法现象上表现强劲,但在句法-语义接口现象上表现较弱且多变。对于数字支持较少的语言,表现也持续较低。探针和机制研究进一步支持TLM中存在句法知识。然而,由于大多数工作仍停留在观察层面,且当前方法在方法论上具有异质性,对句法处理背后的详细计算机制的洞察仍然有限。同时,文献仍然高度集中在英语和BERT类模型上。我们讨论了研究结果的意义,并为未来研究提供了建议。

英文摘要

We present a systematic review of 337 articles evaluating the syntactic abilities of Transformer-based language models (TLMs), reporting on over 3,000 datapoints spanning a wide range of syntactic phenomena, languages, models, and methods. We take the data to collectively show that TLMs encode a non-trivial amount of syntactic knowledge. Behavioral evidence shows strong performance on formal syntactic phenomena, but weaker and more variable performance on phenomena at the syntax-semantics interface. Performance is also consistently lower for languages with less digital support. Probing and mechanistic studies further support the presence of syntactic knowledge in TLMs. Yet, because most work remains observational and current approaches are methodologically heterogeneous, insight into the detailed computational mechanisms underlying syntactic processing remains limited. At the same time, the literature remains heavily concentrated on English and BERT-like models. We discuss the implications of our results and provide recommendations for future research.

2601.08131 2026-05-28 cs.CL

Attention Projection Mixing with Exogenous Anchors

基于外生锚点的注意力投影混合

Jonathan Su

发表机构 * Independent Researcher(独立研究者)

AI总结 针对早期注意力投影跨层重用中内部锚点设计存在的结构冲突,提出ExoFormer模型,通过学习序列层外的外生锚点投影,并引入统一归一化混合框架,在减少令牌使用量的同时提升下游准确率。

详情
AI中文摘要

早期注意力投影的跨层重用可以改善优化和数据效率,但它造成了一个结构冲突:第一层必须同时作为所有更深层的稳定、可重用的锚点和有效的计算块。我们证明这种张力限制了内部锚点设计的性能。我们提出ExoFormer,通过在序列层堆栈之外学习外生锚点投影来解决这一冲突。我们引入了一个统一的归一化混合框架,该框架使用可学习的系数(探索系数粒度:元素级、头级和标量级)混合查询、键、值和门控对数,并表明归一化锚点源是稳定重用的关键。ExoFormer变体始终优于其内部锚点对应物,动态变体在匹配验证损失的情况下,使用比Gated Attention少1.5倍的令牌,获得1.5倍的下游准确率。我们通过卸载假说解释这种有效性:外部锚点保留必要的令牌身份,使层能够专门专注于特征变换。我们发布代码和模型以促进未来研究。

英文摘要

Cross-layer reuse of early attention projections can improve optimization and data efficiency, but it creates a structural conflict: the first layer must simultaneously act as a stable, reusable anchor for all deeper layers and as an effective computational block. We demonstrate that this tension constrains the performance of internal-anchor designs. We propose ExoFormer, which resolves the conflict by learning exogenous anchor projections outside the sequential layer stack. We introduce a unified normalized mixing framework that mixes queries, keys, values, and gate logits using learnable coefficients (exploring coefficient granularities: elementwise, headwise, and scalar), and we show that normalizing anchor sources is key to stable reuse. ExoFormer variants consistently outperform their internal-anchor counterparts, and the dynamic variant yields 1.5x downstream accuracy points while matching validation loss using 1.5x fewer tokens than Gated Attention. We explain this efficacy via an Offloading Hypothesis: external anchors preserve essential token identity, allowing layers to specialize exclusively in feature transformation. We release code and models to facilitate future research.

2509.06350 2026-05-28 cs.CL cs.AI cs.CR

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Mask-GCG:对抗性后缀中的所有标记对于越狱攻击都是必要的吗?

Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang

发表机构 * Politecnico di Milano(米兰理工学院) Beihang University(北京航空航天大学) East China Normal University(华东师范大学) Fudan University(复旦大学) University of the Chinese Academy of Sciences(中国科学院大学) AI Security Lab(360人工智能安全实验室)

AI总结 提出Mask-GCG方法,通过可学习的标记掩码识别后缀中高影响力标记并剪枝低影响力标记,降低计算开销并保持攻击成功率,揭示LLM提示中的标记冗余。

Comments Accepted to ICASSP 2026

Journal ref 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 13887-13891, 2026

详情
AI中文摘要

针对大型语言模型(LLM)的越狱攻击已展示了多种成功方法,攻击者操纵模型生成其本应避免的有害响应。其中,贪婪坐标梯度(GCG)作为一种通用且有效的方法,通过优化后缀中的标记来生成可越狱的提示。尽管已提出多种GCG的改进变体,但它们都依赖于固定长度的后缀。然而,这些后缀中潜在的冗余尚未被探索。在这项工作中,我们提出Mask-GCG,一种即插即用的方法,采用可学习的标记掩码来识别后缀中的高影响力标记。我们的方法增加了高影响力位置标记的更新概率,同时剪枝低影响力位置的标记。这种剪枝不仅减少了冗余,还降低了梯度空间的大小,从而减少了计算开销,并缩短了实现成功攻击所需的时间。我们将Mask-GCG应用于原始GCG及其多种改进变体进行评估。实验结果表明,后缀中的大多数标记对攻击成功有显著贡献,剪枝少数低影响力标记不会影响损失值或攻击成功率(ASR),从而揭示了LLM提示中的标记冗余。我们的发现从越狱攻击的角度为开发高效且可解释的LLM提供了见解。

英文摘要

Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.

2601.17737 2026-05-28 cs.CV cs.AI

The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

脚本即一切:一个用于长程对话到电影视频生成的智能体框架

Chenyu Mu, Xin He, Qu Yang, Wanshun Chen, Jiadi Yao, Huang Liu, Zihao Yi, Bo Zhao, Xingyu Chen, Ruotian Ma, Fanghua Ye, Erkun Yang, Cheng Deng, Zhaopeng Tu, Xiaolong Li, Linus

发表机构 * Tencent(腾讯)

AI总结 提出一个端到端智能体框架,通过训练ScripterAgent将对话转化为精细脚本,并利用DirectorAgent跨场景连续生成策略,实现长程对话到电影视频的连贯生成,显著提升脚本忠实度和时间保真度。

详情
AI中文摘要

近期视频生成的进展产生了能够从简单文本提示合成惊艳视觉内容的模型。然而,这些模型难以从对话等高层概念生成连贯的长篇叙事,揭示了创意想法与其电影执行之间的“语义鸿沟”。为弥合这一鸿沟,我们引入了一个新颖的、端到端的智能体框架,用于对话到电影视频的生成。我们框架的核心是ScripterAgent,一个经过训练将粗略对话转化为精细、可执行的电影脚本的模型。为此,我们构建了ScriptBench,一个具有丰富多模态上下文的新大规模基准,通过专家引导的流程进行标注。生成的脚本随后指导DirectorAgent,它使用跨场景连续生成策略协调最先进的视频模型,以确保长程连贯性。我们的全面评估,包括一个AI驱动的CriticAgent和一个新的视觉-脚本对齐(VSA)指标,表明我们的框架在所有测试的视频模型上显著提高了脚本忠实度和时间保真度。此外,我们的分析揭示了当前SOTA模型在视觉奇观与严格脚本遵循之间的关键权衡,为自动化电影制作的未来提供了宝贵见解。

英文摘要

Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap'' between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.

2601.18116 2026-05-28 cs.CL

BEAR: Budgeted Evidence Allocation for Multi-Document Reasoning

BEAR: 面向多文档推理的预算化证据分配

Lin Sun, Linglin Zhang, Jingang Huang, Change Jia, Zhengwei Cheng, Xiangzheng Zhang

发表机构 * Qiyuan Tech(启元科技)

AI总结 提出BEAR框架,通过构建分层语义索引并在查询时进行由粗到细的证据访问,在固定证据预算下实现高效的多文档推理。

详情
AI中文摘要

我们认为多文档推理不仅受限于模型能读取的文本量,还受限于有限的查询时证据预算如何在文档和语义粒度之间分配。全上下文推理非选择性地向模型提供广泛证据且每次查询成本高,而平面分块检索通常返回局部相关但跨文档综合组织薄弱的段落。我们提出 extbf{BEAR},一个结构化证据分配框架,它离线构建分层语义索引,并在查询时通过互补的 extit{探索}和 extit{恢复}路径进行由粗到细的证据访问。这种由粗到细的设计可视为在固定证据上下文预算下的结构化证据分配。在合成和真实基准上,BEAR在DragonBall上表现尤为强劲,在HotpotQA上与强检索基线保持竞争力,并在我们的评估协议下在2Wiki上取得了最佳的基于检索的结果,同时其查询时证据预算远小于所报告的长上下文参考。进一步分析表明,性能提升与作为分配基础的分层结构以及互补的探索和恢复相关,而非仅靠语义分块。

英文摘要

We argue that multi-document reasoning is constrained not only by how much text a model can read, but also by how limited query-time evidence budget is allocated across documents and semantic granularities. Full-context inference exposes the model to broad evidence non-selectively and at high per-query cost, while flat chunk retrieval often returns locally relevant passages that are weakly organized for cross-document synthesis. We present \textbf{BEAR}, a framework for structured evidence allocation that builds hierarchical semantic indices offline and performs coarse-to-fine evidence access at query time through complementary \emph{exploration} and \emph{recovery} paths. This coarse-to-fine design can be viewed as structured evidence allocation under a fixed evidence-context budget. Across synthetic and real-world benchmarks, BEAR performs particularly strongly on DragonBall, remains competitive with strong retrieval-based baselines on HotpotQA, and yields the best retrieval-based result on 2Wiki under our evaluated protocol, while operating under substantially smaller \emph{query-time evidence budgets} than the reported long-context references. Additional analyses suggest that the gains are associated with hierarchy as an allocation substrate together with complementary exploration and recovery, rather than semantic chunking alone.

2404.06106 2026-05-28 cs.LG

Unifying Low Dimensional Spectra in Deep Learning

统一深度学习中的低维谱

Connall Garrod, Jonathan P. Keating

发表机构 * Mathematical Institute, University of Oxford(牛津大学数学研究所)

AI总结 本文利用无约束特征模型(UFM)证明深度神经坍缩(DNC)是多种深度学习矩阵(如Hessian、梯度和权重)中低维谱结构的统一来源,并给出了特征值和特征向量的解析构造。

Comments revised version; title changed slightly. 45 pages, 20 figures. Accepted at the International Conference on Machine Learning 2026

详情
AI中文摘要

在过参数化分类网络中,深度学习矩阵的特征谱中普遍出现低维结构。尽管理论进展旨在解释这一现象,但通常只能捕捉部分行为或依赖实践中不成立的假设。本文为几种典型的深度学习矩阵(包括Hessian、梯度和权重)的体加离群结构提供了解析解释。我们使用无约束特征模型(UFMs)——一种研究深度神经坍缩(DNC)出现的常用工具——来实现这一点。我们证明DNC是这些低维特征谱的根源,每种情况下,特征值和特征向量都可以从特征均值(DNC的表征对象)构造出来。这为深度学习中的广泛谱现象提供了统一的解析解释,并通过提供特征向量的详细分析,超越了通常仅关注特征值的经验刻画。我们证明结果对线性网络和ReLU网络均成立,并在建模语境和标准数据集上的标准深度网络架构中提供了数值验证。

英文摘要

Low dimensional structures appear ubiquitously in the eigenspectra of deep learning matrices in classification networks trained in the overparameterized regime. While theoretical advances have aimed to explain this phenomenology, they typically succeed only in capturing subsets of the full behavior or rely on assumptions that cannot hold in practice. In this work, we provide an analytic explanation for the bulk plus outlier structure of several canonical deep learning matrices, including the Hessian, gradients, and weights. We achieve this using unconstrained feature models (UFMs), a now-common tool for studying the emergence of deep neural collapse (DNC). We show that DNC is the source of these low dimensional eigenspectra, in each case, the eigenvalues and eigenvectors can be constructed from feature means, the characterizing objects of DNC. This provides a unifying analytic explanation for a wide range of spectral phenomena in deep learning and goes beyond empirical characterizations, which typically focus on eigenvalues, by providing a detailed analysis of eigenvectors. We prove that our results hold for both linear and ReLU networks and provide numerical validation in both the modeling context and standard deep-network architectures on canonical datasets.

2601.18006 2026-05-28 cs.CL

PEAR: Pairwise Evaluation for Automatic Relative Scoring in Machine Translation

PEAR:机器翻译中自动相对评分的成对评估

Lorenzo Proietti, Roman Grundkiewicz, Matt Post

发表机构 * Sapienza University of Rome(罗马萨皮恩扎大学) Microsoft(微软)

AI总结 提出PEAR,一种监督式质量估计指标族,通过成对比较实现无参考机器翻译评估,预测质量差异方向和幅度,在WMT24基准上优于单候选基线,并有效用于最小贝叶斯风险解码。

Comments ACL 2026 Main Conference. 19 pages

详情
AI中文摘要

我们提出PEAR(成对评估用于自动相对评分),一种监督式质量估计(QE)指标族,将无参考机器翻译(MT)评估重新定义为分级成对比较。给定一个源片段和两个候选翻译,PEAR预测它们质量差异的方向和幅度。这些指标使用从人工判断差异中导出的成对监督进行训练,并添加一个正则化项,鼓励在候选顺序反转时符号反转。在WMT24元评估基准上,PEAR优于使用相同数据和骨干网络训练的严格匹配的单候选QE基线,隔离了所提出的成对公式的优势。尽管使用的参数远少于近期的大指标,PEAR超越了更大的QE模型和基于参考的指标。我们的分析进一步表明,与其他顶级指标相比,PEAR产生更少冗余的评估信号。最后,我们展示PEAR是用于最小贝叶斯风险(MBR)解码的有效效用函数,以可忽略的影响降低了成对评分成本。

英文摘要

We present PEAR (Pairwise Evaluation for Automatic Relative Scoring), a supervised quality estimation (QE) metric family that reframes reference-free machine translation (MT) evaluation as a graded pairwise comparison. Given a source segment and two candidate translations, PEAR predicts the direction and magnitude of their quality difference. The metrics are trained using pairwise supervision derived from differences in human judgments, with an additional regularization term that encourages sign inversion under candidate order reversal. On the WMT24 meta-evaluation benchmark, PEAR outperforms strictly matched single-candidate QE baselines trained with the same data and backbones, isolating the benefit of the proposed pairwise formulation. Despite using substantially fewer parameters than recent large metrics, PEAR surpasses far larger QE models and reference-based metrics. Our analysis further indicates that PEAR yields a less redundant evaluation signal relative to other top metrics. Finally, we show that PEAR is an effective utility function for minimum Bayes risk (MBR) decoding, reducing pairwise scoring cost at negligible impact.

2508.14082 2026-05-28 cs.LG

Toward Robust Semi-supervised Regression via Dual-stream Knowledge Distillation

通过双流知识蒸馏实现鲁棒半监督回归

Ye Su, Hezhe Qiao, Wei Huang, Lin Chen

发表机构 * Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences(重庆绿色智能技术研究所,中国科学院) Chongqing School, University of Chinese Academy of Sciences(中国科学院大学重庆学院) Singapore Management University(新加坡管理大学) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 针对半监督回归中未标记数据利用不足和伪标签噪声问题,提出双流知识蒸馏框架(DKD),通过蒸馏连续值知识和分布信息,并结合解耦分布对齐模块,提升回归预测的鲁棒性和样本效率。

Comments 12 pages

详情
AI中文摘要

半监督回归(SSR)旨在预测样本的连续分数,同时减少对大规模标记数据的依赖,近年来在计算机视觉、自然语言处理、音频分析和医学分析等各种应用中引起了广泛关注。现有的SSR方法通常通过引入基于约束的正则化或序数排序来使用稀缺的标记数据训练模型,以减轻过拟合。然而,这些方法往往未能充分利用丰富的未标记样本。尽管一致性驱动的伪标签方法试图纳入未标记数据,但其性能对伪标签质量和噪声预测高度敏感。为了解决这些挑战,我们提出了一个双流知识蒸馏框架(DKD),专门为SSR设计,用于蒸馏连续值知识和分布信息。这种设计更好地保留了回归幅度信息并提高了样本效率。具体来说,在DKD中,教师模型仅使用真实标签进行优化以进行标签分布估计,而学生模型则从真实标签和教师生成的未标记数据伪目标中学习。蒸馏过程实现了有效的监督转移,使学生能够更鲁棒地利用伪标签。此外,我们引入了一个解耦分布对齐(DDA)模块,该模块分别对齐教师和学生之间的目标分布和非目标分布。为了提高非目标知识转移的可靠性,DDA包含一个方差引导的非目标分布对齐策略,该策略自适应地降低不确定的教师预测的权重,从而增强学生减轻伪标签监督中噪声的能力,并学习一个更好校准的回归预测器。

英文摘要

Semi-supervised regression (SSR), which aims to predict continuous scores for samples while reducing the reliance on large-scale labeled data, has recently attracted considerable attention across various applications, including computer vision, natural language processing, audio analysis, and medical analysis. Existing SSR methods typically train models with scarce labeled data by introducing constraint-based regularization or ordinal ranking to mitigate overfitting. However, these approaches often fail to fully exploit the abundance of unlabeled samples. Although consistency-driven pseudo-labeling methods attempt to incorporate unlabeled data, their performance is highly sensitive to pseudo-label quality and noisy predictions. To address these challenges, we propose a Dual-stream Knowledge Distillation framework (DKD), which is specifically designed for SSR to distill both continuous-valued knowledge and distributional information. This design better preserves regression magnitude information and improves sample efficiency. Specifically, in DKD, the teacher is optimized solely with ground-truth labels for label distribution estimation, while the student learns from a mixture of real labels and teacher-generated pseudo targets on unlabeled data. The distillation process enables effective supervision transfer, allowing the student to leverage pseudo labels more robustly. Furthermore, we introduce a Decoupled Distribution Alignment (DDA) module, which separately aligns the target and non-target distributions between the teacher and student. To improve the reliability of non-target knowledge transfer, DDA incorporates a variance-guided non-target distribution alignment strategy that adaptively downweights uncertain teacher predictions, thereby enhancing the student's ability to mitigate noise in pseudo-label supervision and learn a better-calibrated regression predictor.

2601.15015 2026-05-28 cs.LG

Plug-and-Play Benchmarking of Reinforcement Learning Algorithms for Large-Scale Flow Control

强化学习算法在大规模流动控制中的即插即用基准测试

Jannis Becktepe, Aleksandra Franz, Nils Thuerey, Sebastian Peitz

发表机构 * TU Dortmund University(图卢兹大学) Lamarr Institute for Machine Learning(拉马尔机器学习研究所) Technical University Munich(慕尼黑技术大学) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 提出首个完全基于PyTorch、可微分的强化学习流动控制基准套件FluidGym,通过标准化评估协议实现控制方法的系统比较。

Comments Accepted to ICML 2026. Code available at https://github.com/safe-autonomous-systems/fluidgym

详情
AI中文摘要

强化学习(RL)在主动流动控制(AFC)中显示出有希望的结果,但由于现有研究依赖于异构的观测和驱动方案、数值设置和评估协议,该领域的进展仍然难以评估。当前的AFC基准试图解决这些问题,但严重依赖外部计算流体动力学(CFD)求解器,不是完全可微分的,并且对3D和多智能体的支持有限。为了克服这些限制,我们引入了FluidGym,这是第一个独立的、完全可微分的AFC中RL基准套件。FluidGym完全在PyTorch中构建,基于GPU加速的PICT求解器,在单个Python堆栈中运行,不需要外部CFD软件,并提供标准化的评估协议。我们展示了使用PPO、SAC、DPC和TD-MPC的基线结果,并将所有环境、数据集和训练模型作为公共资源发布。FluidGym能够系统比较控制方法,为基于学习的流动控制的未来研究建立可扩展的基础,并可在github.com/safe-autonomous-systems/fluidgym获取。

英文摘要

Reinforcement learning (RL) has shown promising results in active flow control (AFC), yet progress in the field remains difficult to assess as existing studies rely on heterogeneous observation and actuation schemes, numerical setups, and evaluation protocols. Current AFC benchmarks attempt to address these issues but heavily rely on external computational fluid dynamics (CFD) solvers, are not fully differentiable, and provide limited 3D and multi-agent support. To overcome these limitations, we introduce FluidGym, the first standalone, fully differentiable benchmark suite for RL in AFC. Built entirely in PyTorch on top of the GPU-accelerated PICT solver, FluidGym runs in a single Python stack, requires no external CFD software, and provides standardized evaluation protocols. We present baseline results with PPO, SAC, DPC, and TD-MPC, and release all environments, datasets, and trained models as public resources. FluidGym enables systematic comparison of control methods, establishes a scalable foundation for future research in learning-based flow control, and is available at github.com/safe-autonomous-systems/fluidgym.

2505.17654 2026-05-28 cs.CL cs.AI

EVADE-Bench: Multimodal Benchmark for Evaluating and Enhancing Evasive Content Detection

EVADE-Bench:用于评估和增强规避性内容检测的多模态基准

Ancheng Xu, Zhihao Yang, Jingpeng Li, Guanghu Yuan, Longze Chen, Liang Yan, Jiehui Zhou, Zhen Qin, Hengyu Chang, Yukun Chen, Hamid Alinejad-Rokny, Min Yang

发表机构 * SIAT, Chinese Academy of Sciences(中国科学院深圳先进技术研究院) University of Chinese Academy of Sciences(中国科学院大学) Alibaba Group(阿里巴巴集团) University of New South Wales(新南威尔士大学)

AI总结 针对电商平台中LLM/VLM易受规避性内容攻击的问题,提出首个专家标注的中文多模态基准EVADE-Bench,评估26个模型并发现规则分类可提升检测一致性,多智能体分解策略能显著提高准确率。

Comments SIGIR 2026

详情
AI中文摘要

电商平台越来越依赖大型语言模型(LLMs)和视觉语言模型(VLMs)来检测非法或误导性产品内容。然而,这些模型仍然容易受到规避性内容的影响,即通过分词、委婉语言或图像裁剪等技术故意修改的输入,以掩盖违反政策的行为,同时仍传达被禁止的主张。关键在于,检测此类内容需要模型同时掌握两种能力:准确理解复杂规则,以及正确推断故意混淆的多模态输入背后的真实意图。虽然先前的工作分别探索了LLM对复杂规则的推理和基于LLM的规避性内容检测,但现有基准尚未将两者结合在统一的评估框架内。这一差距在电商领域尤为严重,因为准确的审核要求这两种能力协同运作。为填补这一空白,我们引入了EVADE-Bench,这是首个专家策划的中文多模态基准,专门设计用于评估LLMs和VLMs在真实电商场景中的规避性内容检测。我们对26个开源和闭源LLMs及VLMs的全面评估显示,即使是最先进的模型也经常错误分类规避性样本。我们进一步证明,更清晰的规则分类显著提高了模型预测的一致性并减少了错误预测,凸显了基准设计在实现可靠评估中的关键作用。为了探索性能提升的路径,我们研究了多智能体分解在多模态推理中的可行性,即将视觉描述和逻辑推理解耦为独立的智能体,并发现这一策略带来了显著的准确率提升。

英文摘要

E-commerce platforms increasingly rely on Large Language Models (LLMs) and Vision Language Models (VLMs) to detect illicit or misleading product content. However, these models remain vulnerable to evasive content, which refers to inputs that have been deliberately modified through techniques such as word splitting, euphemistic language, or image cropping to conceal policy violations while still conveying prohibited claims. Crucially, detecting such content requires a model to simultaneously master two capabilities: accurately comprehending complex rules, and correctly inferring the true intent behind deliberately obfuscated multimodal inputs. While prior work has separately explored LLM reasoning over complex rules and LLM-based detection of evasive content, no existing benchmark combines both within a unified evaluation framework. This gap is particularly consequential in e-commerce, where accurate moderation demands that both capabilities operate in concert. To address this gap, we introduce EVADE-Bench, the first expert-curated Chinese multimodal benchmark specifically designed to evaluate LLMs and VLMs on evasive content detection in real-world e-commerce scenarios. Our comprehensive evaluation of 26 open- and closed-source LLMs and VLMs reveals that even state-of-the-art models frequently misclassify evasive samples. We further demonstrate that clearer rule categorization significantly improves model prediction consistency and reduces false predictions, highlighting the critical role of benchmark design in enabling reliable evaluation. To explore paths for performance improvement, we investigate the feasibility of multi-agent decomposition for multimodal reasoning, wherein visual description and logical inference are decoupled into separate agents, and find that this strategy yields notable accuracy gains.

2601.12154 2026-05-28 cs.CL

Analyzing Cancer Patients' Experiences with Embedding-based Topic Modeling and LLMs

基于嵌入的主题建模和LLM分析癌症患者的体验

Teodor-Călin Ionescu, Lifeng Han, Jan Heijdra Suasnabar, Anne Stiggelbout, Suzan Verberne

发表机构 * Leiden Institute of Advanced Computer Science (LIACS), Leiden, The Netherlands(莱顿高级计算机科学研究所(LIACS),莱顿,荷兰) Leiden University Medical Center (LUMC), Leiden, The Netherlands(莱顿大学医学中心(LUMC),莱顿,荷兰)

AI总结 本研究利用BERTopic和Top2Vec等神经主题建模方法,结合LLM(GPT4)进行主题标注,从癌症患者访谈数据中提取有意义主题,并评估不同嵌入模型的效果,发现领域特定的BioClinicalBERT嵌入能提高主题精度和可解释性。

Comments accepted by the CLIN journal. The CLIN Journal is the journal for research in computational linguistics in The Netherlands and Belgium

详情
AI中文摘要

本研究探讨了使用神经主题建模和LLM从患者叙述数据中发现有意义主题的方法,以提供有助于更以患者为中心的医疗实践的见解。我们分析了一组转录的癌症患者访谈(13次访谈,共132,722词)。首先,我们通过使用相似的预处理、分块和聚类配置,评估BERTopic和Top2Vec在单个访谈摘要中的关键词提取性能,以确保公平比较。然后,使用LLM(GPT4)进行下一步的主题标注。通过小规模人工评估,对单个访谈(I0)的输出进行评分,重点关注{连贯性}、{清晰度}和{相关性}。基于初步结果和评估,BERTopic表现出更强的性能,并被选用于进一步实验,使用三种{临床导向的嵌入}模型。然后,我们使用最佳模型设置分析了完整的访谈集合。结果表明,领域特定的嵌入提高了主题的 extit{精确度}和 extit{可解释性},其中BioClinicalBERT在转录中产生最一致的结果。使用BioClinicalBERT嵌入模型对全部13次访谈的全局分析揭示了所有13次访谈中最主要的主题,即“癌症护理管理中的协调与沟通”和“患者癌症治疗旅程中的决策”。尽管这些访谈是从荷兰语机器翻译成英语,且临床专业人员未参与评估,但研究结果表明,神经主题建模,特别是BERTopic,可以帮助从患者访谈中为临床医生提供有用的反馈。该流程可以支持更高效的文档导航,并加强患者在医疗工作流程中的声音。

英文摘要

This study investigates the use of neural topic modeling and LLMs to uncover meaningful themes from patient storytelling data, to offer insights that could contribute to more patient-oriented healthcare practices. We analyze a collection of transcribed interviews with cancer patients (132,722 words in 13 interviews). We first evaluate BERTopic and Top2Vec for individual interview summarization by using similar preprocessing, chunking, and clustering configurations to ensure a fair comparison on Keyword Extraction. LLMs (GPT4) are then used for the next step topic labeling. Their outputs for a single interview (I0) are rated through a small-scale human evaluation, focusing on {coherence}, {clarity}, and {relevance}. Based on the preliminary results and evaluation, BERTopic shows stronger performance and is selected for further experimentation using three {clinically oriented embedding} models. We then analyzed the full interview collection with the best model setting. Results show that domain-specific embeddings improved topic \textit{precision} and \textit{interpretability}, with BioClinicalBERT producing the most consistent results across transcripts. The global analysis of the full dataset of 13 interviews, using the BioClinicalBERT embedding model, reveals the most dominant topics throughout all 13 interviews, namely ``Coordination and Communication in Cancer Care Management" and ``Patient Decision-Making in Cancer Treatment Journey''. Although the interviews are machine translations from Dutch to English, and clinical professionals are not involved in this evaluation, the findings suggest that neural topic modeling, particularly BERTopic, can help provide useful feedback to clinicians from patient interviews. This pipeline could support more efficient document navigation and strengthen the role of patients' voices in healthcare workflows.

2601.10714 2026-05-28 cs.CV cs.GR

Alterbute: Editing Intrinsic Attributes of Objects in Images

Alterbute: 编辑图像中物体的内在属性

Tal Reiss, Daniel Winter, Matan Cohen, Alex Rav-Acha, Yael Pritch, Ariel Shamir, Yedid Hoshen

发表机构 * Google(谷歌) The Hebrew University of Jerusalem(耶路撒冷希伯来大学) Reichman University(雷赫曼大学)

AI总结 提出Alterbute方法,通过扩散模型结合松弛训练目标和视觉命名实体,在保持物体身份和场景上下文的同时编辑颜色、纹理、材质和形状等内在属性。

Comments ICML 2026. Project page is available at https://talreiss.github.io/alterbute/

详情
AI中文摘要

我们介绍了Alterbute,一种基于扩散的方法,用于编辑图像中物体的内在属性。我们允许改变物体的颜色、纹理、材质甚至形状,同时保持其感知身份和场景上下文。现有方法要么依赖无监督先验,往往无法保持身份,要么使用过度严格的监督,阻止有意义的内部变化。我们的方法依赖于:(i) 一个松弛的训练目标,允许模型在身份参考图像、描述目标内在属性的文本提示以及定义外在上下文的背景图像和物体掩码的条件下,改变内在和外在属性。在推理时,我们通过重用原始背景和物体掩码来限制外在变化,从而确保只改变所需的内在属性;(ii) 视觉命名实体(VNEs)——细粒度的视觉身份类别(例如“保时捷911 Carrera”),这些类别将共享身份定义特征的物体分组,同时允许内在属性的变化。我们使用视觉语言模型从大型公共图像数据集中自动提取VNE标签和内在属性描述,从而实现可扩展的、保持身份的监督。Alterbute在保持身份的物体内在属性编辑方面优于现有方法。

英文摘要

We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) - fine-grained visual identity categories (e.g., ''Porsche 911 Carrera'') that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.

2601.10334 2026-05-28 cs.CV cs.LG

An analytic theory of convolutional neural network inverse problems solvers

卷积神经网络逆问题求解器的解析理论

Minh Hai Nguyen, Quoc Bao Do, Edouard Pauwels, Pierre Weiss

发表机构 * IRIT \& CBI, CNRS \& Université Toulouse, France Toulouse School of Economics, Université Toulouse Capitole, France

AI总结 通过最小均方误差估计器引入平移等变性和有限感受野的归纳偏置,推导出局部等变MMSE的解析公式,并在多种逆问题、数据集和架构上验证其与神经网络输出高度一致。

Journal ref Forty-Third International Conference on Machine Learning, 2026

详情
AI中文摘要

监督卷积神经网络(CNN)被广泛用于解决成像逆问题,在众多应用中取得了最先进的性能。然而,尽管取得了经验上的成功,这些方法从理论角度仍缺乏理解,常被视为黑箱。为弥合这一差距,我们通过最小均方误差(MMSE)估计器的视角分析训练后的神经网络,并引入捕获CNN两个基本归纳偏置(平移等变性和通过有限感受野的局部性)的功能约束。在经验训练分布下,我们推导出这种约束变体(称为局部等变MMSE,LE-MMSE)的解析、可解释且易于计算的公式。通过在不同逆问题(去噪、修复、去卷积)、数据集(FFHQ、CIFAR-10、FashionMNIST)和架构(U-Net、ResNet、PatchMLP)上的大量数值实验,我们证明了我们的理论与神经网络输出相匹配(PSNR $\gtrsim25$dB)。此外,我们提供了对物理感知和物理无关估计器之间差异、训练(补丁)分布中高密度区域的影响以及其他因素(数据集大小、补丁大小等)影响的见解。

英文摘要

Supervised convolutional neural networks (CNNs) are widely used to solve imaging inverse problems, achieving state-of-the-art performance in numerous applications. However, despite their empirical success, these methods are poorly understood from a theoretical perspective and often treated as black boxes. To bridge this gap, we analyze trained neural networks through the lens of the Minimum Mean Square Error (MMSE) estimator, incorporating functional constraints that capture two fundamental inductive biases of CNNs: translation equivariance and locality via finite receptive fields. Under the empirical training distribution, we derive an analytic, interpretable, and tractable formula for this constrained variant, termed Local-Equivariant MMSE (LE-MMSE). Through extensive numerical experiments across various inverse problems (denoising, inpainting, deconvolution), datasets (FFHQ, CIFAR-10, FashionMNIST), and architectures (U-Net, ResNet, PatchMLP), we demonstrate that our theory matches the neural networks outputs (PSNR $\gtrsim25$dB). Furthermore, we provide insights into the differences between \emph{physics-aware} and \emph{physics-agnostic} estimators, the impact of high-density regions in the training (patch) distribution, and the influence of other factors (dataset size, patch size, etc).

2601.10085 2026-05-28 cs.CL

CALM-IT: Generating Realistic Long-Form Motivational Interviewing Dialogues with Dual-Actor Conversational Dynamics Tracking

CALM-IT: 通过双角色对话动态追踪生成逼真的长形式动机访谈对话

Viet Cuong Nguyen, Nhi Yen Nguyen, Kristin A. Candan, Mary Conlon, Vanessa Rumie, Kristen Risola, Michael L. Birnbaum, Munmun De Choudhury

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Northwell Health(北well健康) Columbia University(哥伦比亚大学)

AI总结 提出CALM-IT框架,通过显式建模客户与咨询师状态的演变来生成和评估长形式动机访谈对话,在8,232个合成对话语料上优于基线方法,尤其在MITI 4.2全局评分和客户接受率上表现最佳。

Comments 53 pages, in submission to EMNLP

详情
AI中文摘要

治疗性对话并非孤立响应的序列:客户目标、动机、抵抗和治疗联盟随时间演变。然而,当前基于LLM的心理健康对话系统通常缺乏在长时间交互中追踪这些动态的显式机制,可能导致时机不当的干预或过早的目标解决。我们引入CALM-IT,一个通过显式建模客户和咨询师状态演变来生成和评估长形式动机访谈对话的框架,指导咨询策略选择和话语生成。我们在包含8,232个合成对话的大规模语料库上评估CALM-IT,涵盖多种对话长度和框架。与所有基线相比,CALM-IT在大多数MITI 4.2全局评分(包括共情、伙伴关系和软化维持谈话)以及其他关键性能指标上取得最佳性能,且随着对话长度增加性能下降最小。值得注意的是,尽管CALM-IT发起的改变导向提示较少,但在不同长度条件下平均客户接受率最高(64.3%)。我们发布了一个可复现的生成框架、一个基于MITI的过程级评估协议,以及一个大规模合成语料库,用于在逼真的长形式交互条件下研究治疗性LLM。

英文摘要

Therapeutic dialogue is not a sequence of isolated responses: client goals, motivation, resistance, and therapeutic alliance evolve over time. Yet current LLM-based mental health dialogue systems often lack explicit mechanisms for tracking these dynamics across extended interactions, which can lead to poorly timed interventions or premature goal resolution. We introduce CALM-IT, a framework for generating and evaluating long-form Motivational Interviewing dialogues through explicit modeling of evolving client and counselor states, guiding both counseling strategy selection and utterance generation. We evaluate CALM-IT on a large-scale corpus of 8,232 synthetic dialogues spanning multiple dialogue lengths and frameworks. Compared with all baselines, CALM-IT achieves the best performance on most MITI 4.2 global ratings, including Empathy, Partnership, and Softening Sustain Talk, as well as on other key performance metrics while exhibiting minimal performance degradation as dialogue length increases. Notably, although CALM-IT initiates fewer change-directed prompts, it produces the highest client acceptance rate (64.3%) on average across different length conditions. We release a reproducible generation framework, a MITI-grounded process-level evaluation protocol, and a large-scale synthetic corpus for studying therapeutic LLMs under realistic long-form interaction conditions.