arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 21503
专题追踪
2601.22450 2026-06-04 cs.LG cs.AI

Tuning the Implicit Regularizer of Masked Diffusion Language Models: Enhancing Generalization via Insights from $k$-Parity

调整掩码扩散语言模型的隐式正则化器:通过$k$-奇偶问题的见解增强泛化能力

Jianhao Huang, Baharan Mirzasoleiman

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 本文通过$k$-奇偶问题研究掩码扩散语言模型的泛化特性,理论分解其目标函数为信号和噪声两部分,并利用噪声作为隐式正则化器,通过优化掩码概率分布显著提升模型性能。

Comments ICML 2026

详情
AI中文摘要

掩码扩散语言模型最近成为一种强大的生成范式,但与自回归模型相比,其泛化特性仍未得到充分研究。本文在$k$-奇偶问题(计算$k$个相关位的异或和)的背景下研究这些特性,其中神经网络通常表现出“grokking”现象——长时间的性能平台期后突然泛化。我们从理论上将掩码扩散(MD)目标分解为驱动特征学习的信号机制和作为隐式正则化器的噪声机制。通过在$k$-奇偶问题上使用MD目标训练nanoGPT,我们证明MD目标从根本上改变了学习景观,实现了快速且同时的泛化,而无需经历grokking。此外,我们利用理论见解优化MD目标中掩码概率的分布。我们的方法显著提高了50M参数模型的困惑度,并在从头预训练和监督微调中均取得了优越结果。具体而言,在8B参数模型上,我们观察到性能提升分别达到$8.8\%$和$5.8\%$,证实了我们的框架在大规模掩码扩散语言模型中的可扩展性和有效性。

英文摘要

Masked Diffusion Language Models have recently emerged as a powerful generative paradigm, yet their generalization properties remain understudied compared to their auto-regressive counterparts. In this work, we investigate these properties within the setting of the $k$-parity problem (computing the XOR sum of $k$ relevant bits), where neural networks typically exhibit grokking -- a prolonged plateau of chance-level performance followed by sudden generalization. We theoretically decompose the Masked Diffusion (MD) objective into a Signal regime which drives feature learning, and a Noise regime which serves as an implicit regularizer. By training nanoGPT using MD objective on the $k$-parity problem, we demonstrate that MD objective fundamentally alters the learning landscape, enabling rapid and simultaneous generalization without experiencing grokking. Furthermore, we leverage our theoretical insights to optimize the distribution of the mask probability in the MD objective. Our method significantly improves perplexity for 50M-parameter models and achieves superior results across both pre-training from scratch and supervised fine-tuning. Specifically, we observe performance gains peaking at $8.8\%$ and $5.8\%$, respectively, on 8B-parameter models, confirming the scalability and effectiveness of our framework in large-scale masked diffusion language model regimes.

2601.22396 2026-06-04 cs.CL cs.AI cs.CY cs.HC physics.soc-ph

Culturally Grounded Personas in Large Language Models: Characterization and Alignment with Socio-Psychological Value Frameworks

大型语言模型中的文化基础人物角色:与社会心理价值框架的表征与对齐

Candida M. Greco, Lucio La Cava, Andrea Tagarelli

发表机构 * DIMES, University of Calabria, Italy(意大利卡拉布里亚大学DIMES研究所)

AI总结 本研究通过世界价值观调查、英格尔哈特-韦尔策尔文化地图和道德基础理论,评估大型语言模型生成的文化基础人物角色是否准确反映不同文化条件下的世界和道德价值体系,并分析其跨文化结构和道德变异。

Comments Under Review

详情
AI中文摘要

尽管大型语言模型(LLMs)在模拟人类行为方面的实用性日益增强,但这些合成人物角色在不同文化条件下是否准确反映世界和道德价值体系仍不确定。本文研究了合成、文化基础人物角色与既定框架(特别是世界价值观调查(WVS)、英格尔哈特-韦尔策尔文化地图和道德基础理论)的对齐情况。我们基于一组可解释的WVS衍生变量概念化并生成LLM人物角色,并通过三个互补视角检查生成的人物角色:在英格尔哈特-韦尔策尔地图上的定位,揭示其反映跨文化条件稳定差异的解释;与世界价值观调查在人口统计层面的一致性,其中响应分布大致追踪人类群体模式;以及源自道德基础问卷的道德轮廓,我们通过文化-道德映射分析道德响应如何在不同文化配置中变化。我们的文化基础人物角色生成和分析方法能够评估跨文化结构和道德变异。

英文摘要

Despite the growing utility of Large Language Models (LLMs) for simulating human behavior, the extent to which these synthetic personas accurately reflect world and moral value systems across different cultural conditionings remains uncertain. This paper investigates the alignment of synthetic, culturally-grounded personas with established frameworks, specifically the World Values Survey (WVS), the Inglehart-Welzel Cultural Map, and Moral Foundations Theory. We conceptualize and produce LLM-generated personas based on a set of interpretable WVS-derived variables, and we examine the generated personas through three complementary lenses: positioning on the Inglehart-Welzel map, which unveils their interpretation reflecting stable differences across cultural conditionings; demographic-level consistency with the World Values Survey, where response distributions broadly track human group patterns; and moral profiles derived from a Moral Foundations questionnaire, which we analyze through a culture-to-morality mapping to characterize how moral responses vary across different cultural configurations. Our approach of culturally-grounded persona generation and analysis enables evaluation of cross-cultural structure and moral variation.

2601.19921 2026-06-04 cs.CL cs.AI

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

揭秘多智能体辩论:置信度与多样性的作用

Xiaochen Zhu, Caiqi Zhang, Yizhou Chi, Tom Stafford, Nigel Collier, Andreas Vlachos

发表机构 * University of Cambridge(剑桥大学) University of Sheffield(谢菲尔德大学)

AI总结 针对多智能体辩论(MAD)在提升大语言模型性能时效果不佳的问题,提出多样性感知初始化和置信度调节辩论协议两种轻量级干预方法,显著提升辩论有效性。

详情
AI中文摘要

多智能体辩论(MAD)被广泛用于通过测试时缩放提升大语言模型(LLM)性能,然而近期研究表明,尽管计算成本更高,普通MAD往往不如简单的多数投票。研究表明,在同质化智能体和统一信念更新下,辩论保持了预期的正确性,因此无法可靠地改善结果。借鉴人类审议和集体决策的研究发现,我们识别出普通MAD缺失的两个关键机制:(i)初始观点的多样性,以及(ii)明确且校准的置信度沟通。我们提出两种轻量级干预方法。首先,一种多样性感知初始化,选择更多样化的候选答案池,增加辩论开始时存在正确假设的可能性。其次,一种置信度调节的辩论协议,其中智能体表达校准后的置信度,并根据他人的置信度调节其更新。我们从理论上证明,多样性感知初始化在不改变底层更新动态的情况下提高了MAD成功的先验概率,而置信度调节更新使辩论能够系统地漂移到正确假设。在实验上,在六个面向推理的QA基准测试中,我们的方法始终优于普通MAD和多数投票。我们的结果将人类审议与基于LLM的辩论联系起来,并表明简单、有原则的修改可以显著增强辩论效果。

英文摘要

Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision-making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication. We propose two lightweight interventions. First, a diversity-aware initialisation that selects a more diverse pool of candidate answers, increasing the likelihood that a correct hypothesis is present at the start of debate. Second, a confidence-modulated debate protocol in which agents express calibrated confidence and condition their updates on others' confidence. We show theoretically that diversity-aware initialisation improves the prior probability of MAD success without changing the underlying update dynamics, while confidence-modulated updates enable debate to systematically drift to the correct hypothesis. Empirically, across six reasoning-oriented QA benchmarks, our methods consistently outperform vanilla MAD and majority vote. Our results connect human deliberation with LLM-based debate and demonstrate that simple, principled modifications can substantially enhance debate effectiveness.

2512.03553 2026-06-04 cs.CV cs.AI

Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching

直播中的动态内容审核:结合监督分类与MLLM增强的相似度匹配

Wei Chee Yew, Hailun Xu, Sanjay Saha, Xiaotian Fan, Hiok Hian Ong, David Yuchen Wang, Kanchan Sarkar, Zhenheng Yang, Danhui Guan

发表机构 * TikTok Singapore Singapore(TikTok新加坡) TikTok San Jose United States(TikTok旧金山美国) TikTok Shanghai China(TikTok上海中国)

AI总结 提出一种混合审核框架,结合监督分类和基于参考的相似度匹配,利用多模态大语言模型提升准确性,在保持轻量推理的同时实现大规模直播内容审核。

Comments To be published at KDD 2026 (ADS track)

详情
AI中文摘要

内容审核对于大规模用户生成视频平台仍然是一个关键且具有挑战性的任务,尤其是在直播环境中,审核必须及时、多模态,并且能够应对不断演变的不良内容形式。我们提出了一个在生产规模部署的混合审核框架,该框架将已知违规的监督分类与针对新颖或微妙情况的基于参考的相似度匹配相结合。这种混合设计能够稳健地检测出明确违规以及传统分类器无法检测到的新颖边缘情况。多模态输入(文本、音频、视觉)通过两个流水线处理,多模态大语言模型(MLLM)将知识提炼到每个流水线中,以提高准确性,同时保持推理轻量。在生产中,分类流水线在80%精确率下达到67%召回率,相似度流水线在80%精确率下达到76%召回率。大规模A/B测试显示,用户对不良直播的观看次数减少了6-8%。这些结果表明了一种可扩展且适应性强的多模态内容治理方法,能够处理明确违规和新兴对抗行为。

英文摘要

Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams}. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.

2506.10912 2026-06-04 cs.AI cs.CL

Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

Breaking Bad Molecules: MLLMs 是否准备好进行结构级分子解毒?

Fei Lin, Ziyang Gong, Cong Wang, Tengchao Zhang, Yonglin Tian, Yining Jiang, Ji Dai, Chao Guo, Xiaotong Yu, Xue Yang, Gen Luo, Fei-Yue Wang

发表机构 * Department of Engineering Science, Macau University of Science and Technology, Macau, China(澳门科学技术大学工程科学系) School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(上海交通大学计算机科学学院) Institute of Automation, Chinese Academy of Sciences, Beijing, China(中国科学院自动化研究所) School of Pharmacy, Macau University of Science and Technology, Macau, China(澳门科学技术大学药学院) Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo, China(宁波大学电气与计算机科学学院) State Key Laboratory of Biopharmaceutical Preparation and Delivery, Institute of Process Engineering, Chinese Academy of Sciences, Beijing, China(中国科学院生物制药制备与递送国家重点实验室) School of Automation and Intelligent Sensing, Shanghai Jiao Tong University, Shanghai, China(上海交通大学自动化与智能感知学院) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室)

AI总结 本文提出 ToxiMol 基准任务,利用多模态大语言模型进行分子毒性修复,并构建数据集、提示流程和自动评估框架 ToxiEval,实验表明当前模型虽面临挑战但展现出毒性理解与结构编辑的潜力。

详情
AI中文摘要

毒性仍然是早期药物开发失败的主要原因。尽管分子设计和性质预测取得了进展,但分子毒性修复任务——生成结构有效且毒性降低的分子替代物——尚未被系统定义或基准化。为填补这一空白,我们引入了 ToxiMol,这是首个针对通用多模态大语言模型(MLLMs)的分子毒性修复基准任务。我们构建了一个标准化数据集,涵盖 11 个主要任务和 660 个代表性有毒分子,覆盖多种机制和粒度。我们设计了一个具有机制感知和任务自适应能力的提示注释流程,并基于专家毒理学知识。同时,我们提出了一个自动评估框架 ToxiEval,将毒性终点预测、合成可及性、类药性和结构相似性集成到高通量评估链中,用于修复成功评估。我们系统评估了 43 个主流通用 MLLMs,并进行了多项消融研究,以分析关键问题,包括评估指标、候选多样性和失败归因。实验结果表明,尽管当前 MLLMs 在此任务上仍面临重大挑战,但它们开始展现出在毒性理解、语义约束遵循和结构感知编辑方面的有前景的能力。

英文摘要

Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the task of molecular toxicity repair, generating structurally valid molecular alternatives with reduced toxicity, has not yet been systematically defined or benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair. We construct a standardized dataset covering 11 primary tasks and 660 representative toxic molecules spanning diverse mechanisms and granularities. We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge. In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success. We systematically assess 43 mainstream general-purpose MLLMs and conduct multiple ablation studies to analyze key issues, including evaluation metrics, candidate diversity, and failure attribution. Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware editing.

2601.19683 2026-06-04 cs.CV

SharpNet: Enhancing MLPs to Represent Functions with Controlled Non-differentiability

SharpNet: 增强MLP以表示具有受控非可微性的函数

Hanting Niu, Junkai Deng, Fei Hou, Wencheng Wang, Ying He

发表机构 * Key Laboratory of System Software (CAS), Institute of Software, Chinese Academy of Sciences Beijing China University of Chinese Academy of Sciences Beijing China(中国科学院软件研究所系统软件重点实验室,中国科学院北京大学,中国)

AI总结 提出SharpNet架构,通过引入基于泊松方程跳跃Neumann边界条件的辅助特征函数,使MLP能够精确控制非可微性位置,从而在保持全局平滑的同时准确恢复尖锐特征。

详情
AI中文摘要

多层感知机(MLP)是学习和函数逼近的标准工具,但它们固有地产生全局平滑的输出。因此,在没有专门后处理的情况下,它们难以表示连续但故意不可微的函数(即具有规定的$C^0$尖锐特征的函数)。我们提出SharpNet,一种改进的MLP架构,通过使用定义为具有跳跃Neumann边界条件的泊松方程解的辅助特征函数来增强网络,从而编码用户指定的尖锐特征。该特征函数通过高效的局部积分进行评估,并且相对于特征位置完全可微,使我们能够联合优化特征位置和MLP参数以恢复目标函数或几何。这种构造提供了对非可微性发生位置的精确控制,在特征位置强制执行所需的$C^0$行为,同时在其他地方保持平滑。我们在2D问题和3D CAD重建上验证了SharpNet,并与几个最先进的基线进行了比较。在两种设置中,SharpNet都能准确恢复尖锐边缘和角落,同时保持远离它们时的平滑,而现有方法往往模糊梯度不连续性。定性和定量结果证明了我们方法的有效性。我们的项目页面、代码和模型可在https://sharpnettech.github.io公开获取。

英文摘要

Multi-layer perceptrons (MLPs) are a standard tool for learning and function approximation, but they inherently produce globally smooth outputs. Consequently, they struggle to represent functions that are continuous yet intentionally non-differentiable (i.e., functions with prescribed $C^0$ sharp features) without ad hoc post-processing. We present SharpNet, a modified MLP architecture that encodes user-specified sharp features by augmenting the network with an auxiliary feature function defined as the solution to Poisson's equation with jump Neumann boundary conditions. This feature function is evaluated via an efficient local integral and is fully differentiable with respect to the feature locations, allowing us to jointly optimize both the feature locations and the MLP parameters to recover the target function or geometry. This construction provides precise control over where non-differentiability occurs, enforcing the desired $C^0$ behavior at feature locations while preserving smoothness elsewhere. We validate SharpNet on 2D problems and 3D CAD reconstruction, and compare it with several state-of-the-art baselines. In both settings, SharpNet accurately recovers sharp edges and corners while remaining smooth away from them, whereas existing methods tend to blur gradient discontinuities. Qualitative and quantitative results demonstrate the effectiveness of our approach. Our project page, code and models are publicly available at https://sharpnettech.github.io.

2601.19449 2026-06-04 cs.LG

Fixed Aggregation Features Can Rival GNNs

固定聚合特征可媲美GNN

Celia Rubio-Madrigal, Rebekka Burkholz

发表机构 * Celia Rubio-Madrigal Rebekka Burkholz

AI总结 提出固定聚合特征(FAFs)方法,将图学习转化为表格问题,通过非训练聚合特征结合表格模型,在多数基准上达到或超越GNN性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

图神经网络(GNN)被广泛认为通过可训练的邻域聚合在节点表示学习中表现出色。我们通过引入固定聚合特征(FAFs)挑战这一观点,这是一种无需训练的方法,将图学习任务转化为表格问题。这一简单转变使得使用成熟的表格方法成为可能,提供了强大的可解释性和部署不同分类器的灵活性。在14个基准测试中,基于FAF训练的调优多层感知机在12个任务上媲美或超越最先进的GNN和图变换器——通常仅使用均值聚合。唯一的例外是Roman Empire和Minesweeper数据集,这些数据集通常需要异常深的GNN。为了解释非可训练聚合的理论可能性,我们将我们的发现与Kolmogorov-Arnold表示联系起来,并讨论何时均值聚合是足够的。总之,我们的结果呼吁:(i)更丰富的基准测试,以受益于学习多样化的邻域聚合;(ii)将强表格基线作为标准;(iii)使用和推进图数据的表格模型,以获得对相关任务的新见解。

英文摘要

Graph neural networks (GNNs) are widely believed to excel at node representation learning through trainable neighborhood aggregations. We challenge this view by introducing Fixed Aggregation Features (FAFs), a training-free approach that transforms graph learning tasks into tabular problems. This simple shift enables the use of well-established tabular methods, offering strong interpretability and the flexibility to deploy diverse classifiers. Across 14 benchmarks, well-tuned multilayer perceptrons trained on FAFs rival or outperform state-of-the-art GNNs and graph transformers on 12 tasks -- often using only mean aggregation. The only exceptions are the Roman Empire and Minesweeper datasets, which typically require unusually deep GNNs. To explain the theoretical possibility of non-trainable aggregations, we connect our findings to Kolmogorov-Arnold representations and discuss when mean aggregation can be sufficient. In conclusion, our results call for (i) richer benchmarks benefiting from learning diverse neighborhood aggregations, (ii) strong tabular baselines as standard, and (iii) employing and advancing tabular models for graph data to gain new insights into related tasks.

2601.18777 2026-06-04 cs.LG cs.AI cs.CL cs.IR stat.AP

PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation

PRECISE: 使用预测驱动的排名估计减少LLM评估的偏差

Abhishek Divekar, Anirban Majumder

发表机构 * Primary contributor and corresponding author(主要贡献者及通讯作者)

AI总结 提出PRECISE框架,通过结合少量人工标注与LLM判断,利用预测驱动推断(PPI)方法,在低资源下可靠估计搜索、排序和RAG系统的指标,并校正LLM偏差。

Comments Accepted at AAAI 2026 - Innovative Applications of AI (IAAI-26)

详情
AI中文摘要

评估搜索、排序和RAG系统的质量传统上需要大量人工相关性标注。近年来,一些已部署的系统探索使用大型语言模型(LLM)作为自动评判者,但其固有偏差阻碍了直接用于指标估计。我们提出了一个扩展预测驱动推断(PPI)的统计框架,将最少的人工标注与LLM判断相结合,以生成需要子实例标注的指标的可靠估计。我们的方法仅需少至100个人工标注查询和10,000个未标注示例,相比传统方法显著减少了标注需求。我们为基于LLM的查询改写应用中的相关性提升推断制定了所提出的框架(PRECISE),将PPI扩展到查询-文档级别的子实例标注。通过重新制定指标集成空间,我们将计算复杂度从O(2^|C|)降低到O(2^K),其中|C|表示语料库大小(百万量级)。在多个著名检索数据集上的详细实验表明,我们的方法降低了业务关键指标Precision@K的估计方差,同时在低资源设置下有效校正了LLM偏差。

英文摘要

Evaluating the quality of search, ranking and RAG systems traditionally requires a significant number of human relevance annotations. In recent times, several deployed systems have explored the usage of Large Language Models (LLMs) as automated judges for this task while their inherent biases prevent direct use for metric estimation. We present a statistical framework extending Prediction-Powered Inference (PPI) that combines minimal human annotations with LLM judgments to produce reliable estimates of metrics which require sub-instance annotations. Our method requires as few as 100 human-annotated queries and 10,000 unlabeled examples, reducing annotation requirements significantly compared to traditional approaches. We formulate our proposed framework (PRECISE) for inference of relevance uplift for an LLM-based query reformulation application, extending PPI to sub-instance annotations at the query-document level. By reformulating the metric-integration space, we reduced the computational complexity from O(2^|C|) to O(2^K), where |C| represents corpus size (in order of millions). Detailed experiments across prominent retrieval datasets demonstrate that our method reduces the variance of estimates for the business-critical Precision@K metric, while effectively correcting for LLM bias in low-resource settings.

2601.18175 2026-06-04 cs.AI cs.LG cs.SY eess.SY stat.ML

Success Conditioning as Policy Improvement: The Optimization Problem Solved by Imitating Success

成功条件化作为策略改进:模仿成功所解决的优化问题

Daniel Russo

发表机构 * Daniel J. Russo

AI总结 本文证明成功条件化(模仿成功轨迹)精确求解了一个信任区域优化问题,其χ²散度约束半径由数据自动确定,并揭示了相对策略改进、策略变化幅度和动作影响之间的等式关系。

详情
AI中文摘要

一种广泛使用的策略改进技术是成功条件化,即收集轨迹,识别那些实现期望结果的轨迹,并更新策略以模仿沿成功轨迹采取的动作。这一原则有许多名称——带SFT的拒绝采样、目标条件化RL、决策Transformer——但它解决了什么优化问题(如果有的话)一直不清楚。我们证明成功条件化精确求解了一个信任区域优化问题,在由数据自动确定半径的χ²散度约束下最大化策略改进。这产生了一个恒等式:相对策略改进、策略变化幅度以及我们称为动作影响(衡量动作选择中的随机变化如何影响成功率)的量在每个状态下都完全相等。因此,成功条件化表现为一个保守的改进算子。精确的成功条件化不会降低性能或引发危险的分布偏移,但当它失败时,它会以可观察的方式失败,即几乎不改变策略。我们将我们的理论应用于常见的回报阈值设定实践,表明这可以放大改进,但代价是可能与真实目标不一致。

英文摘要

A widely used technique for improving policies is success conditioning, in which one collects trajectories, identifies those that achieve a desired outcome, and updates the policy to imitate the actions taken along successful trajectories. This principle appears under many names -- rejection sampling with SFT, goal-conditioned RL, Decision Transformers -- yet what optimization problem it solves, if any, has remained unclear. We prove that success conditioning exactly solves a trust-region optimization problem, maximizing policy improvement subject to a $χ^2$ divergence constraint whose radius is determined automatically by the data. This yields an identity: relative policy improvement, the magnitude of policy change, and a quantity we call action-influence -- measuring how random variation in action choices affects success rates -- are exactly equal at every state. Success conditioning thus emerges as a conservative improvement operator. Exact success conditioning cannot degrade performance or induce dangerous distribution shift, but when it fails, it does so observably, by hardly changing the policy at all. We apply our theory to the common practice of return thresholding, showing this can amplify improvement, but at the cost of potential misalignment with the true objective.

2601.17469 2026-06-04 cs.LG

Identifying and Correcting Label Noise for Robust GNNs via Influence Contradiction

通过影响矛盾识别和纠正标签噪声以实现鲁棒图神经网络

Wei Ju, Wei Zhang, Siyu Yi, Zhengyang Mao, Yifan Wang, Jingyang Yuan, Zhiping Xiao, Ziyue Qiao, Ming Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ICGNN方法,利用图扩散矩阵计算影响矛盾分数(ICS)检测噪声标签,并通过邻居预测软策略纠正噪声标签,结合伪标签提升鲁棒性。

Comments Accepted by Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

图神经网络(GNN)在学习图结构数据方面表现出显著能力,广泛应用于社交分析和生物信息学等领域。然而,现实场景中标签噪声的存在对学习鲁棒GNN构成重大挑战,其有效性在处理图上噪声标签(通常源于标注错误或不一致)时会受到严重影响。为此,本文提出一种名为ICGNN的新方法,利用图的结构信息有效缓解噪声标签带来的挑战。具体地,我们首先设计一种新的噪声指示器,基于图扩散矩阵测量影响矛盾分数(ICS),以量化具有干净标签的节点的可信度,使得ICS值较高的节点更可能被检测为具有噪声标签。然后,我们利用高斯混合模型精确检测节点标签是否含有噪声。此外,我们开发了一种软策略,结合图上邻居节点的预测来纠正检测到的噪声标签。最后,引入大量未标记节点的伪标签,以提供辅助监督信号并指导模型优化。在基准数据集上的实验表明,我们的方法在噪声标签场景下优于竞争基线。

英文摘要

Graph Neural Networks (GNNs) have shown remarkable capabilities in learning from graph-structured data with various applications such as social analysis and bioinformatics. However, the presence of label noise in real scenarios poses a significant challenge in learning robust GNNs, and their effectiveness can be severely impacted when dealing with noisy labels on graphs, often stemming from annotation errors or inconsistencies. To address this, in this paper we propose a novel approach called ICGNN that harnesses the structure information of the graph to effectively alleviate the challenges posed by noisy labels. Specifically, we first design a novel noise indicator that measures the influence contradiction score (ICS) based on the graph diffusion matrix to quantify the credibility of nodes with clean labels, such that nodes with higher ICS values are more likely to be detected as having noisy labels. Then we leverage the Gaussian mixture model to precisely detect whether the label of a node is noisy or not. Additionally, we develop a soft strategy to combine the predictions from neighboring nodes on the graph to correct the detected noisy labels. At last, pseudo-labeling for abundant unlabeled nodes is incorporated to provide auxiliary supervision signals and guide the model optimization. Experiments on benchmark datasets show the superiority of our approach over competitive baselines in noisy label scenarios.

2601.17363 2026-06-04 cs.CL cs.AI

Do readers prefer AI-generated Italian short stories?

读者是否更喜欢AI生成的意大利短篇小说?

Michael Farrell

发表机构 * IULM University Milan Italy(米兰IULM大学)

AI总结 通过盲测实验,比较AI(ChatGPT-4o)与著名作家Alberto Moravia的意大利短篇小说,发现AI文本平均评分略高且更受偏好,但差异不显著,且与人口统计和阅读习惯无关。

Comments 8 pages, peer-reviewed and accepted for presentation at New Trends in Translation and Interpreting Technology (NeTTIT 2026), paged-up for publication

详情
AI中文摘要

本研究调查读者是否更喜欢AI生成的意大利短篇小说,而非著名意大利作家创作的作品。在盲测设置中,20名参与者阅读并评估了三篇故事,其中两篇由ChatGPT-4o生成,一篇由Alberto Moravia创作,参与者不知晓故事来源。为探索潜在影响因素,还收集了阅读习惯和人口统计数据,包括年龄、性别、教育程度和母语。结果显示,AI编写的文本平均评分略高,且更常被偏好,尽管差异不大。文本偏好与人口统计或阅读习惯变量之间未发现统计学显著关联。这些发现挑战了读者偏好人类创作小说的假设,并引发了关于在文学语境中是否需要编辑合成文本的问题。

英文摘要

This study investigates whether readers prefer AI-generated short stories in Italian over one written by a renowned Italian author. In a blind setup, 20 participants read and evaluated three stories, two created with ChatGPT-4o and one by Alberto Moravia, without being informed of their origin. To explore potential influencing factors, reading habits and demographic data, comprising age, gender, education and first language, were also collected. The results showed that the AI-written texts received slightly higher average ratings and were more frequently preferred, although differences were modest. No statistically significant associations were found between text preference and demographic or reading-habit variables. These findings challenge assumptions about reader preference for human-authored fiction and raise questions about the necessity of synthetic-text editing in literary contexts.

2601.06196 2026-06-04 cs.LG cs.AI cs.CL

Geometry-Aware Hallucination Detection in Large Language Models

大语言模型中的几何感知幻觉检测

Bodla Krishna Vamshi, Rohan Bhatnagar, Haizhao Yang

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 提出GA-ICL框架,利用冻结LLM的潜在表示建模局部流形和类别原型几何,选择上下文示例以检测幻觉,在FEVER和HaluEval基准上优于基线方法。

详情
AI中文摘要

大型语言模型(LLM)经常生成事实不正确或未经支持的内容,通常称为幻觉。先前的工作探索了解码策略、检索增强和监督微调用于幻觉检测,而最近的研究表明,上下文学习(ICL)可以显著影响事实可靠性。然而,现有的ICL示例选择方法通常依赖于表面相似性启发式方法,并且在任务和模型上表现出有限的鲁棒性。我们提出GA-ICL,一种几何感知的示例采样框架,用于选择上下文示例,该框架利用从冻结LLM中提取的潜在表示。通过联合建模局部流形结构和类别感知的原型几何,GA-ICL根据示例与学习原型的接近程度进行选择,而不仅仅是基于词汇或嵌入相似性。在事实验证(FEVER)和幻觉检测(HaluEval)基准上,GA-ICL在大多数评估设置中优于标准ICL选择基线,在对话和摘要任务上尤其有显著提升。该方法在温度扰动和模型变化下保持鲁棒性,表明与启发式检索策略相比具有更高的稳定性。虽然在较小模型规模下的某些问答场景中,词汇检索仍可能具有竞争力,但我们的结果表明,几何感知的原型选择为幻觉检测提供了一种可靠且训练轻量的方法,无需修改LLM参数。在Phi-14B和Qwen3-32B上的扩展评估证实,GA-ICL能有效扩展到更大模型,在包括较小模型显示边界条件限制的问答任务在内的所有比较基线上均表现优异,为改进ICL示例选择提供了原则性方向。

英文摘要

Large language models (LLMs) frequently generate factually incorrect or unsupported content, commonly referred to as hallucinations. Prior work has explored decoding strategies, retrieval augmentation, and supervised fine-tuning for hallucination detection, while recent studies show that in-context learning (ICL) can substantially influence factual reliability. However, existing ICL demonstration selection methods often rely on surface-level similarity heuristics and exhibit limited robustness across tasks and models. We propose GA-ICL, a geometry-aware demonstration sampling framework for selecting in-context demonstrations that leverages latent representations extracted from frozen LLMs. By jointly modeling local manifold structure and class-aware prototype geometry, GA-ICL selects demonstrations based on their proximity to learned prototypes rather than lexical or embedding similarity alone. Across factual verification (FEVER) and hallucination detection (HaluEval) benchmarks, GA-ICL outperforms standard ICL selection baselines in the majority of evaluated settings, with particularly strong gains on dialogue and summarization tasks. The method remains robust under temperature perturbations and model variation, indicating improved stability compared to heuristic retrieval strategies. While lexical retrieval can remain competitive in certain question-answering regimes at smaller model scales, our results demonstrate that geometry-aware prototype selection provides a reliable and training-light approach for hallucination detection without modifying LLM parameters. Extended evaluations on Phi-14B and Qwen3-32B confirm that GA-ICL scales effectively to larger models, outperforming all compared baselines including on QA tasks where smaller models show boundary-condition limitations, offering a principled direction for improved ICL demonstration selection.

2601.13735 2026-06-04 cs.AI

Reasoning or Fluency? Dissecting Probabilistic Confidence in Best-of-N Selection

推理还是流畅性?剖析Best-of-N选择中的概率置信度

Hojin Kim, Jaehyung Kim

发表机构 * Yonsei University(延世大学)

AI总结 本文通过引入三类因果扰动实验,发现当前概率置信度指标主要捕捉表面流畅性而非推理质量,并提出对比因果度量以更忠实地选择输出。

Comments 15 pages, 4 figures

详情
AI中文摘要

概率置信度指标越来越多地被用作Best-of-N选择中推理质量的代理,其假设是更高的置信度反映更高的推理保真度。在这项工作中,我们通过调查这些指标是否真正捕捉到有效推理所需的步骤间因果依赖性来挑战这一假设。我们引入了三类步骤间因果扰动,系统地破坏推理步骤之间的依赖性,同时保持局部流畅性。令人惊讶的是,在不同的模型族和推理基准上,我们发现选择精度在这些扰动下仅轻微下降。即使是严重的干预,例如应用硬注意力掩码直接阻止模型关注先前的推理步骤,也不会显著降低选择性能。这些发现提供了强有力的证据,表明当前的概率指标在很大程度上对逻辑结构不敏感,而是主要捕捉表面流畅性或分布内先验。受此差距的启发,我们提出了一种对比因果度量,明确隔离步骤间因果依赖性,并证明它比现有的基于概率的方法产生更忠实的输出选择。

英文摘要

Probabilistic confidence metrics are increasingly adopted as proxies for reasoning quality in Best-of-N selection, under the assumption that higher confidence reflects higher reasoning fidelity. In this work, we challenge this assumption by investigating whether these metrics truly capture inter-step causal dependencies necessary for valid reasoning. We introduce three classes of inter-step causality perturbations that systematically disrupt dependencies between reasoning steps while preserving local fluency. Surprisingly, across diverse model families and reasoning benchmarks, we find that selection accuracy degrades only marginally under these disruptions. Even severe interventions, such as applying hard attention masks that directly prevent the model from attending to prior reasoning steps, do not substantially reduce selection performance. These findings provide strong evidence that current probabilistic metrics are largely insensitive to logical structure, and primarily capture surface-level fluency or in-distribution priors instead. Motivated by this gap, we propose a contrastive causality metric that explicitly isolates inter-step causal dependencies, and demonstrate that it yields more faithful output selection than existing probability-based approaches.

2412.18134 2026-06-04 cs.LG cs.CC cs.PL cs.SE

Learning Randomized Reductions

学习随机归约

Ferhat Erata, Orr Paradise, Thanos Typaldos, Timos Antonopoulos, ThanhVu Nguyen, Shafi Goldwasser, Ruzica Piskac

发表机构 * Yale University, USA(耶鲁大学) EPFL, Switzerland(瑞士联邦理工学院) George Mason University, USA(乔治·梅onn大学)

AI总结 提出 Bitween 框架自动学习随机自归约(RSR),通过线性回归、遗传编程等后端和 LLM 代理,在 80 个函数中分别发现 54% 和 80% 的 RSR,包括首个 sigmoid 归约。

Comments Accepted at ICML 2026 (Spotlight). 9 pages main text + appendix

Journal ref Proceedings of the 43rd International Conference on Machine Learning, PMLR 306, 2026

详情
AI中文摘要

随机自归约(RSR)通过使用 $f$ 在随机相关点上的求值来表达 $f(x)$,从而能够实现自校正程序、实例隐藏协议,并在复杂性理论和密码学中有应用。然而,40 多年来发现 RSR 一直需要手动专家推导,限制了其实际应用。我们提出了用于自动 RSR 学习的 Bitween。首先,我们在相关采样下形式化了 RSR 学习及其样本复杂度分析。其次,我们开发了 Vanilla Bitween,它集成了多个后端(线性回归、遗传编程、符号回归和混合整数规划)。线性回归后端表现最佳,在我们的基准套件 RSR-Bench 中为 80 个函数中的 43 个(54%)发现了 RSR,包括 sigmoid 的首次已知归约。第三,我们引入了 Agentic Bitween,一种神经符号方法,其中 LLM 代理提出超越先前工作中固定集合($x+r$, $x-r$, $x \cdot r$, $x$, $r$)的新查询函数。Agentic Bitween 为 80 个函数中的 64 个(80%)发现了 RSR,在 RSR 发现和验证准确性方面均优于纯神经基线。

英文摘要

Randomized self-reductions (RSRs) express $f(x)$ using $f$ evaluated at random correlated points, enabling self-correcting programs, instance-hiding protocols, and applications in complexity theory and cryptography. Yet discovering RSRs has required manual expert derivation for over 40 years, limiting their practical use. We present Bitween for automated RSR learning. First, we formalize RSR learning with sample complexity analysis under correlated sampling. Second, we develop Vanilla Bitween, which integrates multiple backends (linear regression, genetic programming, symbolic regression, and mixed-integer programming). The linear regression backend outperforms the others, discovering RSRs for 43 of 80 functions (54%) in RSR-Bench, our benchmark suite, including the first known reduction for sigmoid. Third, we introduce Agentic Bitween, a neuro-symbolic approach where LLM agents propose novel query functions beyond the fixed set ($x+r$, $x-r$, $x \cdot r$, $x$, $r$) in prior work. Agentic Bitween discovers RSRs for 64 of 80 functions (80%), outperforming pure neural baselines in both RSR discovery and verification accuracy.

2601.03569 2026-06-04 cs.LG stat.AP

Local Intrinsic Dimensionality of Ground Motion Data for Early Detection of Catastrophic Slope Failure

用于早期检测灾难性边坡破坏的地震动数据的局部内在维度

Yuansan Liu, James Bailey, Antoinette Tordesillas

发表机构 * The University of Melbourne(墨尔本大学) Monash University(莫纳什大学)

AI总结 提出时空局部内在维度(st-LID)无监督框架,通过运动增强、贝叶斯空间融合和时间建模,提高滑坡监测中破坏区域的早期检测精度和提前时间。

Comments 20 pages, 9 figures. ECML-PKDD 2026

详情
AI中文摘要

局部内在维度(LID)在高维数据异常检测中显示出强大潜力,包括颗粒介质中滑坡破坏检测,其中早期准确识别破坏区域对于有效的地质灾害缓解至关重要。然而,由于表面位移数据中固有的空间相关性和时间动态,这项任务仍然具有挑战性。为了解决这一差距,我们提出了一种新颖的无监督框架,称为时空LID(st-LID),它将LID推广到滑坡监测网络中的稳健破坏检测。我们的方法引入了三个关键创新:(1)运动增强,将速度纳入LID计算以捕获瞬时变形率和短期时间动态;(2)贝叶斯空间融合,通过贝叶斯估计聚合空间邻域内的LID值,以嵌入空间相关性并考虑局部噪声;以及(3)时间建模(t-LID),一种新变体,表征位移数据的长期动态,提供位移行为的稳健时间表示。通过统一这些组件,st-LID识别出现有方法经常忽略的复杂多阶段破坏区域。大量实验表明,st-LID在检测精度和提前时间方面始终优于最先进的无监督基线,为滑坡早期预警系统和有针对性的风险干预提供了稳健基础,以增强社区韧性和准备策略。

英文摘要

Local Intrinsic Dimensionality (LID) has shown strong potential for anomaly detection in high-dimensional data, including landslide failure detection in granular media, where early and accurate identification of failure zones is crucial for effective geohazard mitigation. However, this task is still challenging due to the spatial correlations and temporal dynamics that are inherently present in surface displacement data. To address this gap, we propose a novel unsupervised framework called spatiotemporal LID (st-LID) that generalizes the LID for robust failure detection in landslide monitoring networks. Our approach introduces three key innovations: (1) Kinematic enhancement, incorporating velocity into the LID computation to capture instantaneous deformation rates and short-term temporal dynamics; (2) Bayesian spatial fusion, which aggregates LID values across spatial neighborhoods via Bayesian estimation, to embed spatial correlations and account for localized noise; and (3) Temporal modeling (t-LID), a new variant that characterizes long-term dynamics of displacement data, providing a robust temporal representation of displacement behavior. By unifying these components, st-LID identifies complex, multi-stage failure zones often overlooked by existing methods. Extensive experiments show that st-LID consistently outperforms state-of-the-art unsupervised baselines in detection precision and lead-time, providing a robust foundation for landslide early warning systems and targeted risk intervention to enhance community resilience and preparedness strategies.

2601.07408 2026-06-04 cs.CL cs.LG

Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning

基于结果锚定的优势重塑用于数学推理中的细粒度信用分配

Ziheng Li, Liu Kang, Feng Xiao, Luxi Xing, Qingyi Si, Zhuoran Li, Weikang Gong, Deqing Yang, Yanghua Xiao, Hongcheng Guo

发表机构 * Fudan University(复旦大学) XingYun lab, HUJING Digital Media & Entertainment Group(星云实验室,HUJING数字媒体与娱乐集团) University of Science and Technology Beijing(北京科技大学) Chinese Academy of Sciences(中国科学院) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出结果锚定优势重塑(OAR),通过两种策略(OAR-P和OAR-G)实现细粒度信用分配,显著提升GRPO在数学推理中的性能。

详情
AI中文摘要

组相对策略优化(GRPO)已成为一种有前途的无需评论家的强化学习范式,用于推理任务。然而,标准GRPO采用粗粒度的信用分配机制,将组级奖励均匀地传播到序列中的每个令牌,忽略了各个推理步骤的不同贡献。我们通过引入结果锚定优势重塑(OAR)来解决这一局限性,这是一种细粒度的信用分配机制,根据每个令牌对模型最终答案的影响程度重新分配优势。我们通过两种互补策略实例化OAR:(1)OAR-P,通过反事实令牌扰动估计结果敏感性,作为高保真归因信号;(2)OAR-G,使用输入梯度敏感性代理,通过单次反向传播近似影响信号。这些重要性信号与保守的双层优势重塑方案相结合,该方案抑制低影响令牌并提升关键令牌,同时保持整体优势质量。在广泛数学推理基准上的实证结果表明,虽然OAR-P设定了性能上限,但OAR-G以可忽略的计算开销实现了相当的增益,两者均显著优于强GRPO基线,推动了无需评论家的大语言模型推理的边界。

英文摘要

Group Relative Policy Optimization (GRPO) has emerged as a promising critic-free reinforcement learning paradigm for reasoning tasks. However, standard GRPO employs a coarse-grained credit assignment mechanism that propagates group-level rewards uniformly to to every token in a sequence, neglecting the varying contribution of individual reasoning steps. We address this limitation by introducing Outcome-grounded Advantage Reshaping (OAR), a fine-grained credit assignment mechanism that redistributes advantages based on how much each token influences the model's final answer. We instantiate OAR via two complementary strategies: (1) OAR-P, which estimates outcome sensitivity through counterfactual token perturbations, serving as a high-fidelity attribution signal; (2) OAR-G, which uses an input-gradient sensitivity proxy to approximate the influence signal with a single backward pass. These importance signals are integrated with a conservative Bi-Level advantage reshaping scheme that suppresses low-impact tokens and boosts pivotal ones while preserving the overall advantage mass. Empirical results on extensive mathematical reasoning benchmarks demonstrate that while OAR-P sets the performance upper bound, OAR-G achieves comparable gains with negligible computational overhead, both significantly outperforming a strong GRPO baseline, pushing the boundaries of critic-free LLM reasoning.

2601.07036 2026-06-04 cs.CL cs.AI cs.LG

Mid-Think: Training-Free Intermediate-Budget Reasoning via Token-Level Triggers

Mid-Think: 通过词元级触发器实现无需训练的中间预算推理

Wang Yang, Debargha Ganguly, Xinpeng Li, Chaoda Song, Shouren Wang, Vikash Singh, Vipin Chaudhary, Xiaotian Han

发表机构 * Case Western Reserve University(凯斯西储大学)

AI总结 本文通过分析注意力机制和提示实验,发现推理行为主要由少量触发词元控制,并据此提出Mid-Think方法,通过组合触发词元实现中间预算推理,在准确率-长度权衡上优于基线,并能在强化学习训练中减少时间并提升性能。

详情
AI中文摘要

混合推理语言模型通常通过高级的Think/No-think指令来控制推理行为,但我们发现这种模式切换主要由一小部分触发词元驱动,而非指令本身。通过注意力分析和受控提示实验,我们表明开头的“Okay”词元会诱导推理行为,而“</think>”后的换行模式则会抑制推理。基于这一观察,我们提出了Mid-Think,一种简单的无需训练的提示格式,通过组合这些触发器实现中间预算推理,在准确率-长度权衡上始终优于固定词元和基于提示的基线。此外,在监督微调后将Mid-Think应用于强化学习训练,可将训练时间减少约15%,同时将Qwen3-8B在AIME上的最终性能从69.8%提升至72.4%,在GPQA上从58.5%提升至61.1%,证明了其在推理时控制和基于强化学习的推理训练中的有效性。

英文摘要

Hybrid reasoning language models are commonly controlled through high-level Think/No-think instructions to regulate reasoning behavior, yet we found that such mode switching is largely driven by a small set of trigger tokens rather than the instructions themselves. Through attention analysis and controlled prompting experiments, we show that a leading ``Okay'' token induces reasoning behavior, while the newline pattern following ``</think>'' suppresses it. Based on this observation, we propose Mid-Think, a simple training-free prompting format that combines these triggers to achieve intermediate-budget reasoning, consistently outperforming fixed-token and prompt-based baselines in terms of the accuracy-length trade-off. Furthermore, applying Mid-Think to RL training after SFT reduces training time by approximately 15% while improving final performance of Qwen3-8B on AIME from 69.8% to 72.4% and on GPQA from 58.5% to 61.1%, demonstrating its effectiveness for both inference-time control and RL-based reasoning training.

2601.05633 2026-06-04 cs.CL

GIFT: Games as Informal Training for Generalizable LLMs

GIFT:游戏作为通用型LLM的非正式训练

Nuoyan Lyu, Bingbing Xu, Xueyun Tian, Weihao Meng, Yige Yuan, Yang Zhang, Zhiyong Huang, Tat-Seng Chua, Huawei Shen

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, CAS(人工智能安全国家重点实验室,计算技术研究所,中国科学院) University of Chinese Academy of Sciences(中国科学院大学) University of Washington(华盛顿大学) National University of Singapore(新加坡国立大学)

AI总结 提出将游戏作为非正式训练环境,结合协调子任务训练(CST)方法,提升LLM在抽象推理、规划、创造力等通用能力上的泛化性能。

详情
AI中文摘要

最近的LLM在数学推理和代码生成等正式任务上表现出色,但在规划、创造力和社交智能等更广泛的能力上仍然存在困难。受人类学习的启发,其中正式指导和非正式经验共同塑造智力,我们将非正式学习引入LLM训练,并使用游戏作为无注释、反馈驱动的环境。为了涵盖抽象推理、规划、创造力和社交互动等多种能力,我们将正式数学任务与三种代表性游戏任务(矩阵游戏、井字棋和谁是卧底)相结合。然而,在统一的RL目标下直接混合这些任务可能会模糊特定任务的学习信号,并且没有为协调任务梯度方向提供明确的指导。为了解决这些问题,我们提出了协调子任务训练(CST),它用顺序的子任务特定更新替换单一的混合更新,分离异质RL信号,同时隐式促进子任务间的协调。在能力导向基准上的实验表明,基于游戏的非正式学习提高了超越正式训练的泛化能力,而CST通过保持领域内子任务性能并提高更广泛的通用能力,进一步增强了多任务RL。代码和数据已公开。

英文摘要

Recent LLMs excel at formal tasks such as mathematical reasoning and code generation, but still struggle with broader abilities such as planning, creativity, and social intelligence. Inspired by human learning, where formal instruction and informal experience jointly shape intelligence, we introduce informal learning into LLM training and use games as annotation-free, feedback-driven environments. To cover diverse abilities including abstract reasoning, planning, creativity, and social interaction, we combine formal math tasks with three representative game tasks, including Matrix Games, TicTacToe, and Who's the Spy. However, directly mixing these tasks under a unified RL objective can blur task-specific learning signals and provides no explicit guidance for coordinating task-gradient directions. To combat these, we propose Coordinated Subtask Training (CST), which replaces a single mixed update with sequential subtask-specific updates, separating heterogeneous RL signals while implicitly promoting coordination among subtasks. Experiments on ability-oriented benchmarks show that game-based informal learning improves generalization beyond formal training alone, while CST further enhances multi-task RL by preserving in-domain subtask performance and improving broader general abilities. Code and data are publicly available.

2511.07107 2026-06-04 cs.AI cs.CL

MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs

MENTOR: 一种元认知驱动的自我进化框架,用于发现和缓解大语言模型中的隐式领域风险

Liang Shan, Kaicheng Shen, Wen Wu, Zhenyu Ying, Chaochao Lu, Yan Teng, Jingqi Huang, Qingshan Liu, Guangze Ye, Guoqing Wang, Jie Zhou, Liang He

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) Shanghai AI Lab, Shanghai Innovation Institute(上海人工智能实验室,上海创新研究院)

AI总结 针对大语言模型在特定领域(如教育、金融、管理)中存在的隐式安全风险,提出基于元认知自我评估和动态规则知识图谱的MENTOR框架,通过激活级引导信号有效降低攻击成功率。

详情
AI中文摘要

确保大语言模型(LLMs)的安全性对于实际部署至关重要。然而,当前的安全措施往往无法解决隐式的、特定领域的风险。为了研究这一差距,我们引入了一个包含3000个标注查询的数据集,涵盖教育、金融和管理领域。对14个主流LLMs的评估揭示了一个令人担忧的漏洞:平均越狱成功率为57.8%。为此,我们提出了MENTOR,一种元认知驱动的自我进化框架。MENTOR执行元认知自我评估,采用视角转换和后果推理等策略来揭示潜在的模型错位。由此产生的反思被提炼为动态的基于规则的知识图谱,从中检索到的规则被转换为激活级引导信号,以在推理过程中指导内部表示。实验表明,MENTOR在所有测试领域显著降低了攻击成功率,并优于现有的安全对齐方法。MENTOR的代码和数据集可在 https://anonymous.4open.science/r/MENTOR-Evo 获取。

英文摘要

Ensuring the safety of Large Language Models (LLMs) is critical for real-world deployment. However, current safety measures often fail to address implicit, domain-specific risks. To investigate this gap, we introduce a dataset of 3,000 annotated queries spanning education, finance, and management. Evaluations across 14 leading LLMs reveal a concerning vulnerability: an average jailbreak success rate of 57.8\%. In response, we propose MENTOR, a metacognition-driven self-evolution framework. MENTOR performs metacognitive self-assessment, using strategies such as perspective-taking and consequential reasoning to uncover latent model misalignments. The resulting reflections are distilled into dynamic rule-based knowledge graphs, from which retrieved rules are converted into activation-level steering signals to guide internal representations during inference. Experiments demonstrate that MENTOR substantially reduces attack success rates across all tested domains and outperforms existing safety alignment methods. The code and dataset for MENTOR are available at: https://anonymous.4open.science/r/MENTOR-Evo.

2411.05894 2026-06-04 cs.CL cs.AI cs.LG

SSSD: Simply-Scalable Speculative Decoding

SSSD: 简单可扩展的推测解码

Michele Marzollo, Jiawei Zhuang, Niklas Roemer, Niklas Zwingenberger, Lorenz K. Müller, Lukas Cavigelli

发表机构 * Huawei(华为) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出一种无需训练的推测解码方法SSSD,结合轻量级n-gram匹配和硬件感知推测,在多种基准测试中达到与领先训练方法相当的性能,延迟降低高达2.9倍,且对语言和领域变化具有鲁棒性。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026, Main Conference)

详情
AI中文摘要

推测解码已成为加速大型语言模型推理的流行技术。然而,大多数现有方法在生产服务系统中仅带来适度的改进。实现显著加速的方法通常依赖于额外的训练草案模型或辅助模型组件,增加了部署和维护的复杂性。这种增加的复杂性降低了灵活性,特别是当服务负载转移到草案模型训练数据中未充分表示的任务、领域或语言时。我们引入了简单可扩展的推测解码(SSSD),一种无需训练的方法,结合了轻量级n-gram匹配和硬件感知推测。相对于标准自回归解码,SSSD将延迟降低高达2.9倍。它在广泛的基准测试中达到了与领先的基于训练的方法相当的性能,同时需要显著更低的采用成本——无需数据准备、训练或调优——并且在语言和领域变化以及长上下文设置中表现出优越的鲁棒性。

英文摘要

Speculative Decoding has emerged as a popular technique for accelerating inference in Large Language Models. However, most existing approaches yield only modest improvements in production serving systems. Methods that achieve substantial speedups typically rely on an additional trained draft model or auxiliary model components, increasing deployment and maintenance complexity. This added complexity reduces flexibility, particularly when serving workloads shift to tasks, domains, or languages that are not well represented in the draft model's training data. We introduce Simply-Scalable Speculative Decoding (SSSD), a training-free method that combines lightweight n-gram matching with hardware-aware speculation. Relative to standard autoregressive decoding, SSSD reduces latency by up to 2.9x. It achieves performance on par with leading training-based approaches across a broad range of benchmarks, while requiring substantially lower adoption effort--no data preparation, training or tuning are needed--and exhibiting superior robustness under language and domain shift, as well as in long-context settings.

2507.16199 2026-06-04 cs.CL

LLM Abstention Can Be a Prompt Artifact, in Addition to Genuine Uncertainty

LLM 的拒绝回答可能既是真实不确定性的体现,也是提示的产物

Zipeng Ling, Shuliang Liu, Yuehao Tang, Junqi Yang, Shenghong Fu, Chen Huang, Kejia Huang, Yao Wan, Zhichao Hou, Xuming Hu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) University of Pennsylvania(宾夕法尼亚大学) Huazhong University of Science and Technology(华中科技大学) Nanjing University of Posts and Telecommunications(南京邮电大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文发现大语言模型(LLM)的拒绝回答行为不仅源于真实不确定性,还受提示结构影响,称为“拒绝膨胀”,并通过实验证明该现象由额外选项的结构性存在触发,而非真实不确定性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被训练来拒绝回答它们不确定的问题。然而,这种能力经常被误用:在实际应用中,输入提示有时包含不确定性元素,受此驱动,LLM 倾向于拒绝回答它们本可以解决的问题。我们认为 LLM 的拒绝回答不仅是真实不确定性的表达;它也是一种很大程度上受提示影响的产物。我们将这种现象命名为 *拒绝膨胀*。我们为 LLM 添加“未知”作为额外选项供其选择;实验表明,在真/假问题(TFQ)上准确率严重下降。将“未知”替换为不相关的随机词会产生相同的效果。我们认为 LLM 被训练成模仿 *拒绝回答* 的表面模式,而不是表达真实的不确定性。基于十个实验,我们支持四个主张,它们构成了一个递进的论证:(C1)*拒绝膨胀* 是由额外选项的结构性存在触发的,而不是由真实不确定性触发的;(C2)进一步,它使模型在能够回答时也否认自己能回答;(C3)在表示层面,这表现为后层输出覆盖;(C4)最后,这种偏差是稳定的,并通过指令调优出现,而非随机噪声。

英文摘要

Large Language Models (LLMs) are increasingly trained to abstain from answering questions they are unsure about. However, this ability is often misused: in real-world applications, input prompts sometimes contain uncertainty elements, and driven by this, LLMs are inclined to abstain even on problems they are capable of solving. We argue that LLM abstention is not only an expression of genuine uncertainty; it is also an artifact that can be largely influenced by prompts. We name this phenomenon *Abstention Inflation*. We add "Unknown" as an extra option for LLMs to choose from; experiments show serious accuracy drops on True/False Questions (TFQs). Replacing "Unknown" with an unrelated random word produces an identical effect. We argue that LLMs are trained to imitate the surface pattern of *abstention*, rather than to express genuine uncertainty. Based on ten experiments, we support four claims that form a progressive argument: **(C1)** *Abstention Inflation* is triggered by the structural presence of an extra option, not by genuine uncertainty; **(C2)** further, it makes the model deny it can answer even when it can; **(C3)** at the representation level, this manifests as a later-layer output override; **(C4)** finally, this bias is stable and emerges through instruction tuning, rather than stochastic noise.

2512.24698 2026-06-04 cs.RO

Dynamic Policy Learning for Legged Robot with Simplified Model Pretraining and Model-Homotopy-Inspired Transfer

基于简化模型预训练与模型同伦启发迁移的足式机器人动态策略学习

Dongyun Kang, Min-Gyu Kim, Tae-Gyu Song, Hajun Kim, Sehoon Ha, Hae-Won Park

发表机构 * Department of Mechanical Engineering, Korea Advanced Institute of Science and Technology (KAIST)(机械工程系,韩国科学技术院(KAIST)) School of Interactive Computing, Georgia Institute of Technology(交互计算学院,佐治亚理工学院)

AI总结 提出一种延续学习框架,结合简化模型预训练和模型同伦启发迁移,高效生成和优化足式机器人的复杂动态行为,并在真实四足机器人上验证了翻转和墙面辅助等动态任务。

Comments 8 pages

Journal ref IEEE Robotics and Automation Letters, vol. 11, no. 7, pp. 8068-8075, July 2026

详情
AI中文摘要

为足式机器人生成动态运动仍然是一个具有挑战性的问题。虽然强化学习在各种足式运动任务中取得了显著成功,但产生高度动态的行为通常需要大量的奖励调整或高质量的演示。利用降阶模型有助于缓解这些挑战。然而,当将策略迁移到全身动力学环境时,模型差异构成了重大挑战。在这项工作中,我们引入了一个基于延续的学习框架,该框架结合了简化模型预训练和模型同伦启发迁移,以高效生成和优化复杂的动态行为。首先,我们使用单刚体模型预训练策略,以在简化环境中捕获核心运动模式。接下来,我们采用延续策略逐步将策略迁移到全身环境,以最小化性能损失。为了定义延续路径,我们引入了一条从单刚体模型到全身模型的参数化过渡路径,通过逐步重新分配躯干和腿之间的质量和惯性。与基线方法相比,所提出的方法在迁移过程中实现了更快的收敛并表现出更优的稳定性。我们的框架在包括翻转和墙面辅助机动在内的多种动态任务上得到了验证,并成功部署在真实的四足机器人上。

英文摘要

Generating dynamic motions for legged robots remains a challenging problem. While reinforcement learning has achieved notable success in various legged locomotion tasks, producing highly dynamic behaviors often requires extensive reward tuning or high-quality demonstrations. Leveraging reduced-order models can help mitigate these challenges. However, the model discrepancy poses a significant challenge when transferring policies to full-body dynamics environments. In this work, we introduce a continuation-based learning framework that combines simplified model pretraining and model-homotopy-inspired transfer to efficiently generate and refine complex dynamic behaviors. First, we pretrain the policy using a single rigid body model to capture core motion patterns in a simplified environment. Next, we employ a continuation strategy to progressively transfer the policy to the full-body environment, minimizing performance loss. To define the continuation path, we introduce a parametric transition path from the single rigid body model to the full-body model by gradually redistributing mass and inertia between the trunk and legs. The proposed method achieves faster convergence and demonstrates superior stability during the transfer process compared to baseline methods. Our framework is validated on a range of dynamic tasks, including flips and wall-assisted maneuvers, and is successfully deployed on a real quadrupedal robot.

2512.22105 2026-06-04 cs.CV

Learning Association via Track-Detection Matching for Multi-Object Tracking

通过轨迹-检测匹配学习关联用于多目标跟踪

Momir Adžemović

发表机构 * Algoritmi i računarstvo, University of Belgrade(塞尔维亚大学算法与计算机科学系)

AI总结 提出TDLP方法,通过链接预测学习轨迹与检测之间的关联,在保持模块化和计算效率的同时超越现有方法。

Comments 14 pages (+4 for references), 8 tables, 4 figures

详情
AI中文摘要

多目标跟踪旨在通过跨视频帧关联检测来维持目标身份。现有文献中存在两种主要范式:基于检测的跟踪方法,计算效率高但依赖手工设计的关联启发式;以及端到端方法,从数据中学习关联但计算复杂度较高。我们提出轨迹-检测链接预测(TDLP),一种基于检测的跟踪方法,通过轨迹和检测之间的链接预测(即预测每帧中每条轨迹的正确延续)来执行逐帧关联。TDLP在架构上主要针对几何特征(如边界框)设计,同时可选地融入额外线索,包括姿态和外观。与基于启发式的方法不同,TDLP直接从数据中学习关联,无需手工规则,同时与端到端跟踪器相比保持模块化和计算效率。在多个基准上的大量实验表明,TDLP在基于检测的跟踪和端到端方法中均持续超越最先进性能。最后,我们提供了链接预测与基于度量学习的关联的详细分析,并表明链接预测更有效,特别是在处理异构特征(如检测边界框)时。我们的代码可在\href{https://github.com/Robotmurlock/TDLP}{https://github.com/Robotmurlock/TDLP}获取。

英文摘要

Multi-object tracking aims to maintain object identities over time by associating detections across video frames. Two dominant paradigms exist in literature: tracking-by-detection methods, which are computationally efficient but rely on handcrafted association heuristics, and end-to-end approaches, which learn association from data at the cost of higher computational complexity. We propose Track-Detection Link Prediction (TDLP), a tracking-by-detection method that performs per-frame association via link prediction between tracks and detections, i.e., by predicting the correct continuation of each track at every frame. TDLP is architecturally designed primarily for geometric features such as bounding boxes, while optionally incorporating additional cues, including pose and appearance. Unlike heuristic-based methods, TDLP learns association directly from data without handcrafted rules, while remaining modular and computationally efficient compared to end-to-end trackers. Extensive experiments on multiple benchmarks demonstrate that TDLP consistently surpasses state-of-the-art performance across both tracking-by-detection and end-to-end methods. Finally, we provide a detailed analysis comparing link prediction with metric learning-based association and show that link prediction is more effective, particularly when handling heterogeneous features such as detection bounding boxes. Our code is available at \href{https://github.com/Robotmurlock/TDLP}{https://github.com/Robotmurlock/TDLP}.

2512.21094 2026-06-04 cs.CV

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

T2AV-Compass:迈向文本到音频-视频生成的统一评估

Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jiahao Wang, Jialu Chen, Miao Deng, Yubin Guo, Chenxi Liao, Yize Zhang, Zhaoxiang Zhang, Jiaheng Liu

发表机构 * NJU-LINK Team, Nanjing University(南京大学NJU-LINK团队) Kling Team, Kuaishou Technology(快手技术 Kling 团队) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 提出T2AV-Compass基准,通过分类学驱动的500个复杂提示和双层次评估框架(客观信号指标+主观MLLM评判),系统评估文本到音频-视频生成模型,发现现有模型在跨模态对齐和指令遵循方面显著不足。

Comments 41 pages, 13 figures, 12 tables. Accepted at ICML 2026

详情
AI中文摘要

文本到音频-视频(T2AV)生成旨在从自然语言合成时间连贯的视频和语义同步的音频,但其评估仍然碎片化,通常依赖单模态指标或范围狭窄的基准,无法捕捉跨模态对齐、指令遵循和复杂提示下的感知真实性。为解决这一局限,我们提出了T2AV-Compass,一个用于全面评估T2AV系统的统一基准,包含通过分类学驱动流程构建的500个多样且复杂的提示,以确保语义丰富性和物理合理性。此外,T2AV-Compass引入了一个双层次评估框架,将用于视频质量、音频质量和跨模态对齐的客观信号级指标与用于指令遵循和真实性评估的主观MLLM-as-a-Judge协议相结合。对11个代表性T2AV系统的广泛评估表明,即使是最强的模型也远未达到人类水平的真实性和跨模态一致性,在音频真实性、细粒度同步、指令遵循等方面持续失败。这些结果表明未来模型有显著的改进空间,并凸显了T2AV-Compass作为推进文本到音频-视频生成的挑战性和诊断性测试平台的价值。

英文摘要

Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AVsystems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.

2512.17678 2026-06-04 cs.LG cs.AI

You Only Train Once: Differentiable Subset Selection for Omics Data

你只训练一次:用于组学数据的可微分子集选择

Daphné Chopard, Jorge da Silva Gonçalves, Irene Cannistraci, Thomas M. Sutter, Julia E. Vogt

发表机构 * Department of Computer Science, ETH Zurich(计算机科学系,苏黎世联邦理工学院) Department of Intensive Care and Neonatology, University Children’s Hospital Zurich(重症医学与新生儿科,苏黎世大学儿童医院)

AI总结 提出YOTO框架,通过端到端可微架构联合选择离散基因子集并进行预测,实现稀疏、多任务学习,提升单细胞转录组数据分析性能。

Comments Camera-ready version accepted at Transactions on Machine Learning Research (TMLR)

Journal ref Transactions on Machine Learning Research, 2026

详情
AI中文摘要

从单细胞转录组数据中选择紧凑且信息丰富的基因子集对于生物标志物发现、提高可解释性和成本效益分析至关重要。然而,大多数现有的特征选择方法要么作为多阶段流水线运行,要么依赖于事后特征归因,使得选择和预测弱耦合。在这项工作中,我们提出了YOTO(你只训练一次),一个端到端框架,在单个可微架构中联合识别离散基因子集并进行预测。在我们的模型中,预测任务直接指导选择哪些基因,而学习到的子集反过来塑造预测表示。这种闭环反馈使模型能够在训练过程中迭代地优化其选择内容和预测方式。与现有方法不同,YOTO强制执行稀疏性,使得只有选中的基因对推理有贡献,从而无需训练额外的下游分类器。通过多任务学习设计,模型在相关目标之间学习共享表示,使得部分标记的数据集能够相互提供信息,并发现无需额外训练步骤即可跨任务泛化的基因子集。我们在两个代表性的单细胞RNA-seq数据集上评估YOTO,显示它持续优于最先进的基线。这些结果表明,稀疏、端到端、多任务的基因子集选择提高了预测性能,并产生了紧凑且有意义的基因子集,推进了生物标志物发现和单细胞分析。

英文摘要

Selecting compact and informative gene subsets from single-cell transcriptomic data is essential for biomarker discovery, improving interpretability, and cost-effective profiling. However, most existing feature selection approaches either operate as multi-stage pipelines or rely on post hoc feature attribution, making selection and prediction weakly coupled. In this work, we present YOTO (you only train once), an end-to-end framework that jointly identifies discrete gene subsets and performs prediction within a single differentiable architecture. In our model, the prediction task directly guides which genes are selected, while the learned subsets, in turn, shape the predictive representation. This closed feedback loop enables the model to iteratively refine both what it selects and how it predicts during training. Unlike existing approaches, YOTO enforces sparsity so that only the selected genes contribute to inference, eliminating the need to train additional downstream classifiers. Through a multi-task learning design, the model learns shared representations across related objectives, allowing partially labeled datasets to inform one another, and discovering gene subsets that generalize across tasks without additional training steps. We evaluate YOTO on two representative single-cell RNA-seq datasets, showing that it consistently outperforms state-of-the-art baselines. These results demonstrate that sparse, end-to-end, multi-task gene subset selection improves predictive performance and yields compact and meaningful gene subsets, advancing biomarker discovery and single-cell analysis.

2512.16919 2026-06-04 cs.CV cs.AI cs.RO

DVGT: Driving Visual Geometry Transformer

DVGT: 驾驶视觉几何变换器

Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Shengyin Jiang, Long Chen, Zhi-Xin Yang, Jiwen Lu

发表机构 * Tsinghua University(清华大学) University of Macau(澳门大学) Xiaomi EV(小米电动车) Peking University(北京大学)

AI总结 提出DVGT,一种从无位姿多视角图像序列重建全局稠密3D点图的视觉几何变换器,通过交替注意力机制学习几何关系,无需相机参数和后处理对齐,在多个驾驶数据集上显著优于现有模型。

Comments Code is available at https://github.com/wzzheng/DVGT

详情
AI中文摘要

从视觉输入中感知和重建3D场景几何对于自动驾驶至关重要。然而,目前仍缺乏一种能够适应不同场景和相机配置的、面向驾驶的稠密几何感知模型。为弥补这一空白,我们提出了驾驶视觉几何变换器(DVGT),它从一系列无位姿的多视角视觉输入中重建全局稠密3D点图。我们首先使用DINO骨干网络为每张图像提取视觉特征,并采用交替的视角内局部注意力、跨视角空间注意力和跨帧时间注意力来推断图像间的几何关系。然后,我们使用多个头解码第一帧自车坐标系下的全局点图以及每帧的自车位姿。与依赖精确相机参数的传统方法不同,DVGT无需显式的3D几何先验,能够灵活处理任意相机配置。DVGT直接从图像序列预测度量尺度的几何,消除了与外部传感器后对齐的需求。在包含nuScenes、OpenScene、Waymo、KITTI和DDAD的大型驾驶数据集混合训练下,DVGT在各种场景中显著优于现有模型。代码可在https://github.com/wzzheng/DVGT获取。

英文摘要

Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.

2512.15552 2026-06-04 cs.CL

Automated Lexical Coverage for Language Learning: From General to Specialized Word Lists

语言学习的自动化词汇覆盖:从通用到专业词表

Dakota Ellis, Samy Babikerali, Wanshan Chen, Bao Dinh, Uyen Le

发表机构 * University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校) School of Data Science(数据科学学院)

AI总结 本文提出一种基于目标文本自动生成专业词表的方法,相比通用词表能以更少词汇达到95%的文本覆盖率,并实现自动化、可扩展的词汇学习资源构建。

详情
AI中文摘要

通用服务词表(GSL)是语言学习者识别重要英语单词的常用资源。传统的GSL创建依赖语言专业知识和主观输入,资源消耗大。我们创建了自己的GSL,并评估其与新通用服务词表(NGSL)的性能。我们发现,针对特定文本定制的专业词表(SWL)是语言学习者的实用方法。由于SWL源自目标文本本身,它通过构造达到语言理解所需的95%覆盖率,并且与应用于同一文本的通用词表相比,使用的词汇量显著更少:在涵盖小说、学术论文和脚本的九个文本中,NGSL覆盖了每个文本的64-85%,而文本特定词表以更小的词汇量达到95%。通过仅依赖客观标准,SWL过程可以自动化、可扩展,并针对全球语言学习者的需求进行定制。

英文摘要

A General Service List (GSL) is a commonly used resource for language learners to identify important English words. Traditional GSL creation is resource-intensive, relying on linguistic expertise and subjective input. We created our own GSL and evaluated its performance against the New General Service List (NGSL). We found that creating a Specialized Word List (SWL), tailored to a specific text, is a practical method for language learners. Because an SWL is derived from the target text itself, it reaches the 95% coverage required for language comprehension by construction, and it does so with substantially fewer words than a general list applied to the same text: across nine texts spanning fiction, academic papers, and scripts, the NGSL covered 64-85% of each text, whereas a text-specific list reached 95% with far smaller vocabularies. By restricting the SWL process to objective criteria only, it can be automated, scaled, and tailored to the needs of language-learners across the globe.

2512.05277 2026-06-04 cs.CV cs.AI

From Segments to Scenes: Temporal Understanding for Agentic Autonomous Driving via Vision-Language Models

从片段到场景:自动驾驶中基于视觉语言模型的时间理解

Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain, Ahmad Rezaei, Mohsen Gholami, Alireza Heidarikhazaei, Zhou Weimin, Yong Zhang, Mohammad Akbari

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Toronto(多伦多大学) ETH Zurich(苏黎世联邦理工学院) University of Washington(华盛顿大学) University of Southern California(南加州大学)

AI总结 提出自动驾驶时间理解基准TAD,通过场景思维链和轨迹认知图两种无训练方法提升视觉语言模型的时间推理能力。

详情
AI中文摘要

视觉语言模型(VLM)越来越多地被部署为野外自主代理的感知和推理骨干,其中自动驾驶(AD)是最安全关键的实例之一。可靠的时间理解对于此类代理预测事件、归因原因和在动态环境中安全行动至关重要,但即使对于最先进的(SoTA)VLM来说,这仍然是一个重大挑战。先前的视频基准强调了其他内容(体育、烹饪等),但现有基准没有专门关注短时和长时AD视频的时间理解。为填补这一空白,我们提出了自动驾驶时间理解(TAD)基准,包含近6000个问答(QA)对,涵盖7个任务,并评估了9个闭源和开源通用以及AD专用模型。当前SoTA模型在TAD上的表现远低于人类准确率。为了改进基于VLM的驾驶代理的时间推理,我们提出了两种新颖的无训练解决方案:Scene-CoT,它使用思维链(CoT)推理;以及TCogMap,它结合了由轨迹分析模块生成的自我中心时间认知图,该模块作为VLM周围的代理工具运行。与现有VLM集成后,我们的方法在TAD上的平均准确率提高了高达17.72%,在STSBench上提高了高达10.35%。通过引入TAD、对SoTA模型进行基准测试并提出有效的增强方法,本工作旨在促进野外代理AD系统时间理解的进一步进展。基准和评估代码分别可在${\href{https://huggingface.co/datasets/vbdai/TAD}{ ext{Hugging Face}}}$和${\href{https://github.com/vbdi/tad_bench}{ ext{GitHub}}}$上获取。

英文摘要

Vision-Language Models (VLMs) are increasingly deployed as the perception and reasoning backbone of autonomous agents acting in the wild, with autonomous driving (AD) being one of the most safety-critical instances. Reliable temporal understanding is essential for such agents to anticipate events, attribute causes, and act safely in dynamic environments, yet this remains a significant challenge even for state-of-the-art (SoTA) VLMs. Prior video benchmarks have emphasized other content (sports, cooking, etc.), yet no existing benchmark focuses exclusively on temporal understanding for both short- and long-form AD footage. To fill this gap, we present the Temporal Understanding in Autonomous Driving (TAD) benchmark, comprising nearly 6000 question-answer (QA) pairs across 7 tasks, and evaluate 9 closed- and open-source generalist as well as AD-specialist models. Current SoTA models perform substantially below human accuracy on TAD. To improve the temporal reasoning of VLM-based driving agents, we propose two novel training-free solutions: Scene-CoT, which uses Chain-of-Thought (CoT) reasoning, and TCogMap, which incorporates an ego-centric temporal cognitive map produced by a trajectory-analysis module that operates as an agentic tool around the VLM. Integrated with existing VLMs, our methods improve average accuracy on TAD by up to $17.72\%$ and by up to $10.35\%$ on STSBench. By introducing TAD, benchmarking SoTA models, and proposing effective enhancements, this work aims to catalyze further progress on temporal understanding for agentic AD systems operating in the wild. The benchmark and evaluation code are available at ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ and ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$, respectively.

2512.08331 2026-06-04 cs.CV

DMAConv: Dual Mask-Adaptive Convolution for Remote Sensing Pansharpening

DMAConv: 用于遥感全色锐化的双掩膜自适应卷积

Xianghong Xiao, Zeyu Xia, Zhou Fei, Jinliang Xiao, Haorui Chen, Liangjian Deng

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Tongji University(同济大学)

AI总结 提出双掩膜自适应卷积(DMAConv),通过软硬掩膜动态分配计算资源,以轻量级双分支结构高效处理遥感图像的区域异质性,实现SOTA性能且计算成本最低。

详情
AI中文摘要

全色锐化旨在融合高分辨率全色图像与低分辨率多光谱图像。现有的深度学习方法,包括最近的自适应卷积,难以应对遥感图像的区域异质性,且往往计算成本过高。为解决这些挑战,我们提出双掩膜自适应卷积(DMAConv),这是一种根据特征特征动态分配计算资源的新型算子。DMAConv首先使用轻量级模块生成软掩膜和硬掩膜。硬掩膜将特征分为一个紧凑分支(用于全局处理冗余信息)和一个聚焦分支(以更多计算投入建模复杂异质区域)。随后,软掩膜对两个分支的输入特征进行初步调制。这种双分支掩膜自适应设计显著增强了特征表示,同时最小化了计算开销。大量实验表明,我们的方法在广泛的定量基准上达到了SOTA,且参数数量显著更低,计算成本在自适应卷积模型中最低。

英文摘要

Pansharpening aims to fuse a high-resolution panchromatic image with a low-resolution multispectral image. Existing deep learning methods, including recent adaptive convolutions, struggle with regional heterogeneity in remote sensing images and often incur prohibitive computational costs. To address these challenges, we propose Dual Mask-Adaptive Convolution (DMAConv), a novel operator that dynamically allocates computational resources based on feature characteristics. DMAConv first employs a lightweight module to generate soft and hard masks. The hard mask separates features into a compact branch for processing redundant information globally and a focused branch that models complex, heterogeneous regions with greater computational investment. The soft mask then preliminarily modulates the input features for both branches. This dual-branch, mask-adaptive design significantly enhances feature representation while minimizing computational overhead. Extensive experiments demonstrate that our method achieves SOTA on a broad array of quantitative benchmarks, with substantially lower parameter counts and the minimal computational cost among adaptive convolution models.

2512.08094 2026-06-04 cs.CL

Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

分割、嵌入和对齐:将字幕与手语对齐的通用方法

Zifan Jiang, Youngjoon Jang, Liliane Momeni, Gül Varol, Sarah Ebling, Andrew Zisserman

发表机构 * VGG, Dept. of Engineering Science, University of Oxford(视觉感知与计算实验室,工程科学系,牛津大学) University of Zurich(苏黎世大学) KAIST(韩国科学技术院) LIGM, CNRS, Univ Gustave Eiffel, ENPC, IP Paris(LIGM,国家科学研究中心,古斯塔夫·埃菲尔大学,巴黎理工大学,IP巴黎)

AI总结 提出一种通用框架SEA,利用预训练模型分割视频帧序列为单个手势、嵌入手势片段到与文本共享的潜在空间,并通过轻量动态规划实现高效对齐,在多个手语数据集上达到最先进性能。

Comments Camera-ready version of ACL 2026 (Main)

详情
AI中文摘要

本文的目标是开发一种通用方法,用于将字幕(即带有对应时间戳的口语文本)与连续手语视频对齐。先前的方法通常依赖于针对特定语言或数据集的端到端训练,这限制了它们的通用性。相比之下,我们的方法Segment, Embed, and Align (SEA)提供了一个适用于多种语言和领域的单一框架。SEA利用两个预训练模型:第一个模型将视频帧序列分割为单个手势,第二个模型将每个手势的视频片段嵌入到与文本共享的潜在空间中。随后,通过轻量级动态规划程序进行对齐,该程序即使在长达一小时的视频中也能在CPU上高效运行,耗时不到一分钟。SEA灵活且能适应各种场景,利用从小型词汇表到大型连续语料库的资源。在四个手语数据集上的实验展示了最先进的对齐性能,突显了SEA在生成高质量并行数据以推动手语处理方面的潜力。SEA的代码和模型已公开提供。

英文摘要

The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.