arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 21503
专题追踪
2509.07963 2026-06-04 cs.LG

Customizing the Inductive Biases of Softmax Attention using Structured Matrices

使用结构化矩阵定制软注意力机制的归纳偏置

Yilun Kuang, Noah Amsel, Sanae Lotfi, Shikai Qiu, Andres Potapczynski, Andrew Gordon Wilson

发表机构 * University of Cambridge(剑桥大学)

AI总结 针对标准注意力机制在低维投影信息损失和缺乏距离依赖偏置的问题,提出基于块张量列(BTT)和连续多级低秩(MLR)结构化矩阵的高秩评分函数,在上下文回归、语言建模和长程时间序列预测中提升性能。

Comments ICML 2025. Code available at https://github.com/YilunKuang/structured-attention

详情
AI中文摘要

注意力机制的核心组件是评分函数,它将输入转换为低维查询和键,并计算每对向量的点积。虽然低维投影提高了效率,但对于某些具有本质高维输入的任务,它会导致信息损失。此外,注意力对所有输入对使用相同的评分函数,而没有对序列中相邻标记施加距离相关的计算偏置。在这项工作中,我们通过提出基于计算高效的高秩结构化矩阵(包括块张量列(BTT)和连续多级低秩(MLR)矩阵)的新评分函数来解决这些缺陷。在高维输入的上下文回归任务中,我们提出的评分函数在任意固定计算预算下均优于标准注意力。在语言建模(一种表现出局部性模式的任务)中,基于MLR的注意力方法相比标准注意力和滑动窗口注意力的变体实现了改进的扩展定律。此外,我们表明BTT和MLR都属于更广泛的高效结构化矩阵家族,能够编码全秩或距离依赖的计算偏置,从而解决了标准注意力的显著缺陷。最后,我们展示了MLR注意力在长程时间序列预测中具有令人期待的结果。

英文摘要

The core component of attention is the scoring function, which transforms the inputs into low-dimensional queries and keys and takes the dot product of each pair. While the low-dimensional projection improves efficiency, it causes information loss for certain tasks that have intrinsically high-dimensional inputs. Additionally, attention uses the same scoring function for all input pairs, without imposing a distance-dependent compute bias for neighboring tokens in the sequence. In this work, we address these shortcomings by proposing new scoring functions based on computationally efficient structured matrices with high ranks, including Block Tensor-Train (BTT) and contiguous Multi-Level Low Rank (MLR) matrices. On in-context regression tasks with high-dimensional inputs, our proposed scoring functions outperform standard attention for any fixed compute budget. On language modeling, a task that exhibits locality patterns, our MLR-based attention method achieves improved scaling laws compared to both standard attention and variants of sliding window attention. Additionally, we show that both BTT and MLR fall under a broader family of efficient structured matrices capable of encoding either full-rank or distance-dependent compute biases, thereby addressing significant shortcomings of standard attention. Finally, we show that MLR attention has promising results for long-range time-series forecasting.

2509.03351 2026-06-04 cs.LG cs.AI q-bio.QM

epiGPTope: A machine learning-based epitope generator and classifier

epiGPTope: 一种基于机器学习的表位生成器和分类器

Natalia Flechas Manrique, Alberto Martínez, Elena López-Martínez, Luc Andrea, Román Orus, Aitor Manteca, Aitziber L. Cortajarena, Llorenç Espinosa-Portalés

发表机构 * Multiverse Computing(多维计算公司) Centre for Cooperative Research in Biomaterials (CIC biomaGUNE)(生物材料联合研究中心) Basque Research and Technology Alliance (BRTA)(巴斯克研究与技术联盟) Donostia International Physics Center(多斯蒂亚国际物理中心) Ikerbasque Foundation for Science(伊kerbasque科学基金会) IKERBASQUE(伊kerbasque)

AI总结 提出基于大型语言模型epiGPTope,通过预训练和微调直接生成新型表位序列,并结合统计分类器预测表位来源(细菌或病毒),以加速合成表位库的构建和筛选。

Comments 11 pages, 4 figures. Supplementary Information with 5 pages, 4 figures

Journal ref ACS Synthetic Biology 2026 15 (2), 631-642

详情
AI中文摘要

表位是能被抗体或免疫细胞受体识别的短抗原肽序列,对免疫疗法、疫苗和诊断的开发至关重要。然而,由于巨大的组合序列空间(n个氨基酸的线性表位有$20^n$种组合),即使采用高通量实验技术,合成表位库的合理设计也极具挑战。在本研究中,我们提出了一种大型语言模型epiGPTope,该模型在蛋白质数据上预训练,并专门针对线性表位进行微调,首次能够直接生成新型表位样序列,这些序列被发现具有与已知表位相似的统计特性。这种生成方法可用于制备表位候选序列库。我们进一步训练统计分类器来预测表位序列是细菌来源还是病毒来源,从而缩小候选库范围,提高识别特定表位的可能性。我们提出,这种生成模型与预测模型的组合有助于表位发现。该方法仅使用线性表位的一级氨基酸序列,无需几何框架或手工特征。通过开发生成生物学可行序列的方法,我们预期能更快、更经济地生成和筛选合成表位,并在新生物技术开发中具有相关应用。

英文摘要

Epitopes are short antigenic peptide sequences which are recognized by antibodies or immune cell receptors. These are central to the development of immunotherapies, vaccines, and diagnostics. However, the rational design of synthetic epitope libraries is challenging due to the large combinatorial sequence space, $20^n$ combinations for linear epitopes of n amino acids, making screening and testing unfeasible, even with high throughput experimental techniques. In this study, we present a large language model, epiGPTope, pre-trained on protein data and specifically fine-tuned on linear epitopes, which for the first time can directly generate novel epitope-like sequences, which are found to possess statistical properties analogous to the ones of known epitopes. This generative approach can be used to prepare libraries of epitope candidate sequences. We further train statistical classifiers to predict whether an epitope sequence is of bacterial or viral origin, thus narrowing the candidate library and increasing the likelihood of identifying specific epitopes. We propose that such combination of generative and predictive models can be of assistance in epitope discovery. The approach uses only primary amino acid sequences of linear epitopes, bypassing the need for a geometric framework or hand-crafted features of the sequences. By developing a method to create biologically feasible sequences, we anticipate faster and more cost-effective generation and screening of synthetic epitopes, with relevant applications in the development of new biotechnologies.

2508.01815 2026-06-04 cs.CL cs.AI

From Graph Retrieval to Schema Realization: Counterfactual Validation for Text-to-SPARQL over Heterogeneous Knowledge Graphs

从图检索到模式实现:面向异构知识图谱的文本到SPARQL的反事实验证

Chengxiao Dai, Yue Xiu, Dusit Niyato

发表机构 * University of Bristol(布里斯托大学)

AI总结 提出SchemaForge框架,通过问题条件化的模式切片对齐和反事实验证,在异构知识图谱上提升文本到SPARQL查询生成的执行准确率。

详情
AI中文摘要

文本到SPARQL将自然语言问题映射为RDF知识图谱上的可执行SPARQL查询。标准评估通常预先固定目标图,但实际知识图谱问答(KGQA)可能涉及具有不同模式、部分对齐和不完整元数据的异构图集合。在此设置下,查询生成不仅依赖于SPARQL语法:系统必须识别能够支持问题所需的谓词、实体类型、连接、过滤器和约束的图模式。我们提出SchemaForge,一个面向异构KG集合的文本到SPARQL的基于模式的智能体框架。其核心机制是问题条件化的模式切片对齐:弱图证据首先识别可能的图,而更强的模式证据确定局部模式切片能否实现预期查询。选定的模式切片随后在执行前约束查询生成和验证。当仅有一个图可用时,该公式简化为带有模式基础的标准单KG文本到SPARQL。我们在LC-QuAD 2.0、QALD-9 Plus、QALD-10和Spider4SPARQL上评估SchemaForge。在四个公开基准上,SchemaForge相比最强匹配的智能体基线平均提高执行准确率11.50个百分点。在Spider4SPARQL上,SchemaForge将执行准确率从54.86%提升至64.18%,并达到73.0%的Top-1和97.0%的Top-3图分配准确率。这些结果表明,从弱图证据转向模式特定的查询承诺,结合反事实答案集检查,改进了异构知识图谱上的可执行查询生成。

英文摘要

Text-to-SPARQL maps natural-language questions to executable SPARQL queries over RDF knowledge graphs. While standard evaluations often fix the target graph in advance, practical knowledge graph question answering (KGQA) may involve heterogeneous graph collections with different schemas, partial alignments, and incomplete metadata. In this setting, query generation depends on more than SPARQL syntax: the system must identify a graph schema that can support the predicates, entity types, joins, filters, and constraints required by the question. We present SchemaForge, a schema-grounded agentic framework for text-to-SPARQL over heterogeneous KG collections. Its central mechanism is question-conditioned schema-slice alignment: weak graph evidence first identifies plausible graphs, while stronger schema evidence determines whether a local schema slice can realize the intended query. The selected schema slice then constrains query generation and verification before execution. When only one graph is available, the same formulation reduces to standard single-KG text-to-SPARQL with schema grounding. We evaluate SchemaForge on LC-QuAD 2.0, QALD-9 Plus, QALD-10, and Spider4SPARQL. Across the four public benchmarks, SchemaForge improves execution accuracy over the strongest matched agent baseline by 11.50 percentage points on average. On Spider4SPARQL, SchemaForge improves execution accuracy from 54.86% to 64.18% and achieves 73.0% Top-1 and 97.0% Top-3 graph allocation accuracy. These results show that moving from weak graph evidence to schema-specific query commitments, together with counterfactual answer-set checks, improves executable query generation over heterogeneous knowledge graphs.

2507.21892 2026-06-04 cs.CL

Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning

Graph-R1:通过端到端强化学习实现智能图RAG框架

Haoran Luo, Haihong E, Guanting Chen, Qika Lin, Yikai Guo, Fangzhi Xu, Zemin Kuang, Meina Song, Xiaobao Wu, Yifan Zhu, Luu Anh Tuan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Graph-R1,首个通过端到端强化学习的智能图RAG框架,采用轻量知识超图构建、多轮智能体-环境交互检索和端到端奖励机制,在推理准确性、检索效率和生成质量上优于传统图RAG和强化学习增强RAG方法。

Comments Accepted by ICML 2026 main conference

Journal ref ICML 2026

详情
AI中文摘要

检索增强生成(RAG)通过引入外部知识减轻大语言模型中的幻觉,但依赖于缺乏结构语义的基于块的检索。图RAG方法通过将知识建模为实体-关系图来改进RAG,但仍面临构建成本高、固定一次性检索以及依赖长上下文推理和提示设计等挑战。为解决这些问题,我们提出Graph-R1,首个通过端到端强化学习(RL)的智能图RAG框架。它引入了轻量知识超图构建,将检索建模为多轮智能体-环境交互,并通过端到端奖励机制优化智能体过程。在标准RAG数据集上的实验表明,Graph-R1在推理准确性、检索效率和生成质量上优于传统图RAG和强化学习增强的RAG方法。我们的软件和数据公开在https://github.com/LHRLAB/Graph-R1。

英文摘要

Retrieval-Augmented Generation (RAG) mitigates hallucination in LLMs by incorporating external knowledge, but relies on chunk-based retrieval that lacks structural semantics. GraphRAG methods improve RAG by modeling knowledge as entity-relation graphs, but still face challenges in high construction cost, fixed one-time retrieval, and reliance on long-context reasoning and prompt design. To address these challenges, we propose Graph-R1, the first agentic GraphRAG framework via end-to-end reinforcement learning (RL). It introduces lightweight knowledge hypergraph construction, models retrieval as a multi-turn agent-environment interaction, and optimizes the agent process via an end-to-end reward mechanism. Experiments on standard RAG datasets show that Graph-R1 outperforms traditional GraphRAG and RL-enhanced RAG methods in reasoning accuracy, retrieval efficiency, and generation quality. Our software and data are publicly available at https://github.com/LHRLAB/Graph-R1.

2507.21638 2026-06-04 cs.AI cs.LG cs.MA cs.RO

Assistax: A Multi-Agent Hardware-Accelerated Reinforcement Learning Benchmark for Assistive Robotics

Assistax: 一个用于辅助机器人的多智能体硬件加速强化学习基准

Leonard Hinckeldey, Elliot Fosong, Rimvydas Rubavicius, Elle Miller, Trevor McInroe, Fan Zhang, Patricia Wollstadt, Stefano V. Albrecht, Subramanian Ramamoorthy

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出Assistax基准,利用JAX硬件加速和基于多智能体强化学习的辅助机器人任务,实现高达370倍加速,并测试机器人的零样本协调能力。

Comments Accepted at the Reinforcement Learning Conference 2026

详情
AI中文摘要

强化学习(RL)算法的发展在很大程度上受到具有挑战性的任务和基准的推动。游戏在RL基准中占据主导地位,因为它们呈现了相关的挑战,运行成本低且易于理解。虽然围棋和Atari等游戏带来了许多突破,但它们通常不能直接转化为现实世界的具身应用。在认识到需要多样化RL基准并解决具身交互场景中出现的复杂性的情况下,我们引入了Assistax:一个旨在解决辅助机器人任务中出现的挑战的开源基准。Assistax利用JAX的硬件加速,在基于物理的模拟中实现显著的学习加速。在开环挂钟时间方面,Assistax在向量化训练运行时比基于CPU的替代方案快高达370倍。Assistax使用多智能体RL将辅助机器人与活跃的人类患者之间的交互概念化,以训练一群多样化的伙伴智能体,从而可以测试具身机器人智能体的零样本协调能力。对流行的连续控制RL和MARL算法进行的广泛评估和超参数调优提供了可靠的基线,并将Assistax确立为推进辅助机器人RL研究的实用基准。代码可在以下网址获取:https://github.com/assistive-autonomy/assistax。

英文摘要

The development of reinforcement learning (RL) algorithms has been largely driven by ambitious challenge tasks and benchmarks. Games have dominated RL benchmarks because they present relevant challenges, are inexpensive to run and easy to understand. While games such as Go and Atari have led to many breakthroughs, they often do not directly translate to real-world embodied applications. In recognising the need to diversify RL benchmarks and addressing complexities that arise in embodied interaction scenarios, we introduce Assistax: an open-source benchmark designed to address challenges arising in assistive robotics tasks. Assistax uses JAX's hardware acceleration for significant speed-ups for learning in physics-based simulations. In terms of open-loop wall-clock time, Assistax runs up to $370\times$ faster when vectorising training runs compared to CPU-based alternatives. Assistax conceptualises the interaction between an assistive robot and an active human patient using multi-agent RL to train a population of diverse partner agents against which an embodied robotic agent's zero-shot coordination capabilities can be tested. Extensive evaluation and hyperparameter tuning for popular continuous control RL and MARL algorithms provide reliable baselines and establish Assistax as a practical benchmark for advancing RL research for assistive robotics. The code is available at: https://github.com/assistive-autonomy/assistax.

2507.03373 2026-06-04 cs.CL

WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia

WETBench:用于检测维基百科上特定任务机器生成文本的基准

Gerrit Quaremba, Elizabeth Black, Denny Vrandečić, Elena Simperl

发表机构 * King’s College London(伦敦国王学院) Wikimedia Foundation(维基媒体基金会)

AI总结 提出WETBench,一个多语言、多生成器、任务特定的基准,用于检测维基百科编辑场景下的机器生成文本,实验表明训练型检测器平均准确率78%,零样本检测器平均58%。

详情
AI中文摘要

鉴于维基百科作为高质量、可靠内容的可信来源,对其平台上由大型语言模型(LLM)产生的低质量机器生成文本(MGT)的扩散担忧日益增加。因此,可靠的MGT检测至关重要。然而,现有工作主要在通用生成任务上评估MGT检测器,而非维基百科编辑者更常执行的任务。这种错位可能导致在真实维基百科场景中应用时泛化能力差。我们引入了WETBench,一个多语言、多生成器、任务特定的MGT检测基准。我们定义了三个编辑任务,这些任务基于维基百科编辑者对LLM辅助编辑的感知用例进行实证:段落写作、摘要和文本风格迁移,我们使用两个新数据集在三种语言中实现这些任务。对于每个写作任务,我们评估三个提示,使用表现最佳的提示跨多个生成器生成MGT,并对多种检测器进行基准测试。我们发现,在各种设置下,基于训练的检测器平均准确率达到78%,而零样本检测器平均为58%。这些结果表明,检测器在现实生成场景中难以应对MGT,并强调了在多样化、任务特定数据上评估此类模型以评估其在编辑驱动场景中可靠性的重要性。

英文摘要

Given Wikipedia's role as a trusted source of high-quality, reliable content, concerns are growing about the proliferation of low-quality machine-generated text (MGT) produced by large language models (LLMs) on its platform. Reliable detection of MGT is therefore essential. However, existing work primarily evaluates MGT detectors on generic generation tasks rather than on tasks more commonly performed by Wikipedia editors. This misalignment can lead to poor generalisability when applied in real-world Wikipedia contexts. We introduce WETBench, a multilingual, multi-generator, and task-specific benchmark for MGT detection. We define three editing tasks, empirically grounded in Wikipedia editors' perceived use cases for LLM-assisted editing: Paragraph Writing, Summarisation, and Text Style Transfer, which we implement using two new datasets across three languages. For each writing task, we evaluate three prompts, generate MGT across multiple generators using the best-performing prompt, and benchmark diverse detectors. We find that, across settings, training-based detectors achieve an average accuracy of 78%, while zero-shot detectors average 58%. These results show that detectors struggle with MGT in realistic generation scenarios and underscore the importance of evaluating such models on diverse, task-specific data to assess their reliability in editor-driven contexts.

2506.05233 2026-06-04 cs.LG cs.AI cs.CL

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

MesaNet: 通过局部最优测试时训练进行序列建模

Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Sarthak Mittal, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, Rif A. Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Agüera y Arcas, João Sacramento

发表机构 * Google(谷歌) Paradigms of Intelligence Team(智能范式团队) Google DeepMind(谷歌深Mind) MIT CSAIL(麻省理工学院CSAIL)

AI总结 提出一种基于共轭梯度求解器实现局部最优测试时训练的Mesa层,在保持常数推理成本的同时,在语言建模困惑度和下游基准性能上超越现有RNN模型。

Comments Published at ICLR 2026

详情
AI中文摘要

序列建模目前主要由使用softmax自注意力的因果Transformer架构主导。尽管被广泛采用,Transformer在推理时需要线性扩展内存和计算。最近一系列工作将softmax操作线性化,产生了具有恒定内存和计算成本的强大循环神经网络模型,如DeltaNet、Mamba或xLSTM。这些模型可以通过注意到其循环层动态都源于上下文回归目标(通过在线学习规则近似优化)来统一。在此,我们加入这一系列工作,引入最近提出的Mesa层(von Oswald等人,2024)的一个数值稳定、可分块并行化的版本,该层原本只能顺序运行,因此不可扩展。该层同样源于上下文损失,但现在使用快速共轭梯度求解器在每个时间点将其最小化至最优。通过一系列扩展到十亿参数规模的实验,我们表明最优测试时训练使得语言建模困惑度更低,下游基准性能优于之前的RNN,尤其是在需要长上下文理解的任务上。这一性能提升以推理时额外浮点运算为代价。因此,我们的结果与最近增加测试时计算以提高性能的趋势有趣地相关——这里通过花费计算在神经网络内部解决序列优化问题来实现。

英文摘要

Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), which could only run sequentially in time and was therefore not scalable. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments study up to the billion-parameter scale, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.

2506.04281 2026-06-04 cs.LG

Uncovering Insights of Compound Flooding with Data-Driven AI

利用数据驱动AI揭示复合洪水的内在机制

Xu Zheng, Chaohao Lin, Sipeng Chen, Zhuomin Chen, Jimeng Shi, Jayantha Obeysekera, Jingchao Ni, Wei Cheng, Jason Liu, Dongsheng Luo

发表机构 * Florida International University(佛罗里达国际大学) Florida State University(佛罗里达州立大学) UIUC(伊利诺伊大学香槟分校) University of Houston(休斯顿大学) NEC Lab America(NEC美国实验室) Singapore Management University(新加坡管理大学)

AI总结 通过整合潮汐、降雨、地下水位和人类水管理活动,利用数据驱动方法分析南佛罗里达复合洪水,发现地下水位是主要预测因子,空间耦合状态比长期时间依赖更重要。

Comments Accepted to SIGKDD 2026 AI for Science Track; 12 Pages, 5 Figures, 6 Tables

详情
AI中文摘要

复合洪水由多个水文气象因素之间的非线性相互作用驱动,对灾害预防构成重大挑战。现有的预测方法,无论是基于物理的还是数据驱动的,通常强调时间模式,而较少探索多个相互作用因素如何共同塑造洪水动态。为了解决这个问题,我们通过整合潮汐条件、降雨、地下水位和人类水管理活动,对南佛罗里达(一个典型的复合洪水区域)进行了大规模数据驱动的复合洪水分析。我们的分析揭示了三个关键发现:(i)仅捕捉时间动态的模型无法代表复合事件期间的多因素相互作用;(ii)地下水位反映的地表下饱和度成为洪水严重程度的主要预测因子,在这个多孔沿海地区往往超过即时降雨强度;(iii)有限有效半径内周围监测站的空间状态为洪水提供了关键的因果背景,而延长时间历史在极端事件中收益递减。这些发现表明,复合洪水更多地受空间耦合系统状态而非长期时间依赖性的支配,挑战了以降雨为中心和序列主导的预测范式。通过将数据驱动模型定位为科学探究工具而非仅用于预测,本研究为复合洪水的机制提供了新见解,并为设计更基于物理的沿海环境早期预警系统提供了信息。我们的数据集和代码公开在 https://github.com/AslanDing/SFBench。

英文摘要

Compound flooding, driven by nonlinear interactions between multiple hydrometeorological factors, poses a significant challenge to hazard prevention. Existing forecasting approaches, whether physics-based or data-driven, often emphasize temporal patterns while underexploring how multiple interacting factors jointly shape flood dynamics. To address this problem, we conduct a large-scale data-driven analysis of compound flooding in South Florida, a typical area for compound flooding, by integrating tidal conditions, rainfall, groundwater stage, and human water management activities. Our analysis reveals three key findings: (i) models that capture temporal dynamics alone fail to represent multi-factor interactions during compound events; (ii) subsurface saturation, as reflected by groundwater levels, emerges as a dominant predictor of flood severity, often outweighing immediate rainfall intensity in this porous coastal region; and (iii) the spatial state of surrounding monitoring stations within a finite effective radius provides critical causal context for flooding, while extending temporal history yields diminishing returns during extreme events. These findings suggest that compound flooding is governed more by spatially coupled system states than by long-term temporal dependencies, challenging rain-centric and sequence-dominated forecasting paradigms. By framing data-driven models as tools for scientific inquiry rather than prediction alone, this study offers new insights into the mechanisms of compound flooding and informs the design of more physically grounded early-warning systems for coastal environments. Our dataset and code are publicly available at https://github.com/AslanDing/SFBench.

2505.19293 2026-06-04 cs.CL cs.AI cs.LG

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

100-LongBench:事实上的长上下文基准是否真的在评估长上下文能力?

Wang Yang, Hongye Jin, Shaochen Zhong, Song Jiang, Qifan Wang, Vipin Chaudhary, Xiaotian Han

发表机构 * Case Western Reserve University(凯斯西储大学) Texas A&M University(德克萨斯A&M大学) Rice University(里德大学) University of California, Los Angeles(加州大学洛杉矶分校) Meta(Meta公司)

AI总结 针对现有长上下文基准无法分离基线能力与真实长上下文能力、且输入长度固定等问题,提出长度可控的长上下文基准和新指标,以有效评估大语言模型的长上下文能力。

详情
AI中文摘要

长上下文能力被认为是LLM最重要的能力之一,因为真正具备长上下文能力的LLM使用户能够轻松处理许多原本繁琐的任务——例如,阅读长文档寻找答案与直接询问LLM。然而,现有的基于真实任务的长上下文评估基准有两个主要缺陷。首先,像LongBench这样的基准通常没有提供适当的指标来将长上下文性能与模型的基线能力分开,使得跨模型比较不清晰。其次,此类基准通常以固定输入长度构建,这限制了它们在不同模型上的适用性,并且无法揭示模型何时开始崩溃。为了解决这些问题,我们引入了一个长度可控的长上下文基准和一个新颖的指标,该指标将基线知识与真实的长上下文能力解耦。实验证明了我们的方法在有效评估LLM方面的优越性。

英文摘要

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.

2505.17315 2026-06-04 cs.AI cs.CL cs.LG

Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning

更长上下文,更深思考:揭示长上下文能力在推理中的作用

Wang Yang, Zirui Liu, Hongye Jin, Qingyu Yin, Vipin Chaudhary, Xiaotian Han

发表机构 * Case Western Reserve University(凯斯西储大学) University of Minnesota - Twin Cities(明尼苏达大学双城分校) Texas A&M University(德克萨斯阿姆大学)

AI总结 本研究通过实验发现,增强模型的长上下文能力(在监督微调前)能显著提升推理性能,即使对于短输入任务也有泛化收益,表明长上下文建模是推理能力的关键基础。

详情
AI中文摘要

近期语言模型展现出强大的推理能力,但长上下文能力对推理的影响仍未充分探索。在本工作中,我们假设当前推理能力的局限性部分源于长上下文能力不足,这一假设基于经验观察:(1)更高的上下文窗口长度通常带来更强的推理性能,(2)失败的推理案例与失败的长上下文案例相似。为验证这一假设,我们检验了在监督微调(SFT)前增强模型的长上下文能力是否能提升推理性能。具体而言,我们比较了架构和微调数据相同但长上下文能力不同的模型。结果揭示了一致趋势:长上下文能力更强的模型在SFT后,在推理基准上取得了显著更高的准确率。值得注意的是,即使在输入长度较短的任务上,这些增益也持续存在,表明长上下文训练为推理性能提供了可泛化的益处。这些发现表明,长上下文建模不仅对处理长输入至关重要,而且也是推理的关键基础。我们主张将长上下文能力作为未来语言模型设计的首要目标。

英文摘要

Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored. In this work, we hypothesize that current limitations in reasoning stem, in part, from insufficient long-context capacity, motivated by empirical observations such as (1) higher context window length often leads to stronger reasoning performance, and (2) failed reasoning cases resemble failed long-context cases. To test this hypothesis, we examine whether enhancing a model's long-context ability before Supervised Fine-Tuning (SFT) leads to improved reasoning performance. Specifically, we compared models with identical architectures and fine-tuning data but varying levels of long-context capacity. Our results reveal a consistent trend: models with stronger long-context capacity achieve significantly higher accuracy on reasoning benchmarks after SFT. Notably, these gains persist even on tasks with short input lengths, indicating that long-context training offers generalizable benefits for reasoning performance. These findings suggest that long-context modeling is not just essential for processing lengthy inputs, but also serves as a critical foundation for reasoning. We advocate for treating long-context capacity as a first-class objective in the design of future language models.

2502.17956 2026-06-04 cs.CL

Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments

在跨语言和多语言环境中更好地理解程序思维推理

Patomporn Payoungkhamdee, Pume Tuchinda, Jinheon Baek, Samuel Cahyawijaya, Can Udomcharoenchaikit, Potsawee Manakul, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Sarana Nutanong

发表机构 * School of Information Science and Technology, VISTEC(信息科学与技术学院,VISTEC) KAIST(韩国科学技术院) Cohere SCB 10X AI Singapore(AI新加坡) Department of Computer Engineering, Chulalongkorn University(朱拉隆梭大学计算机工程系)

AI总结 通过分离推理与代码执行,提出评估程序思维提示的框架,发现微调显著提升多语言推理能力,且推理质量与答案准确性强相关。

Journal ref Findings of the Association for Computational Linguistics: ACL 2025

详情
AI中文摘要

多步推理对于大型语言模型至关重要,但多语言性能仍然具有挑战性。虽然思维链提示改进了推理,但由于推理与执行的纠缠,它在非英语语言中表现不佳。程序思维提示将推理与执行分离,提供了一种有前景的替代方案,但将挑战转移到从非英语问题生成程序上。我们提出了一个框架,通过分离多语言推理与代码执行来评估程序思维,以检验(i)微调对问题-推理对齐的影响,以及(ii)推理质量如何影响答案正确性。我们的发现表明,程序思维微调显著增强了多语言推理,优于思维链微调模型。我们进一步证明了推理质量(通过代码质量衡量)与答案准确性之间的强相关性,突出了其作为测试时性能改进启发式方法的潜力。

英文摘要

Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.

2505.15354 2026-06-04 cs.LG stat.ML

Post-Training Corrections for Improved Time-Series Forecasting

人在回路的自适应优化用于改进时间序列预测

Hamza Cherkaoui, Malik Tiomoko, Giuseppe Paolo, Zhang Yili, Yu Meng, Zhang Keli, Hafiz Tiomoko Ali

发表机构 * SAMOVAR Télécom SudParis Institut Polytechnique de Paris(Telecom SudParis高等研究院) Noah Ark Lab(Noah Ark实验室) Independent Researcher(独立研究员)

AI总结 提出一种无需重训练或修改架构的轻量级后训练自适应优化框架,通过强化学习、上下文赌博机或遗传算法自动学习表达性变换来校正模型输出,并支持人类专家通过自然语言引导校正,从而在多个基准上以最小计算开销持续提升预测精度。

详情
AI中文摘要

时间序列预测模型即使在能源、金融和医疗等关键领域也经常产生系统性的、可预测的错误。我们引入了一种新颖的后训练自适应优化框架,无需重训练或架构更改即可提高预测准确性。我们的方法自动应用通过强化学习、上下文赌博机或遗传算法优化的表达性变换,以轻量级和模型无关的方式校正模型输出。理论上,我们证明了仿射校正总能降低均方误差;实际上,我们通过基于动态动作的优化扩展了这一思想。该框架还支持可选的人回路组件:领域专家可以使用自然语言指导校正,自然语言由语言模型解析为动作。在多个基准(例如电力、天气、交通)上,我们观察到以最小的计算开销持续提高准确性。我们的交互式演示展示了该框架的实时可用性。通过将自动事后改进与可解释和可扩展的机制相结合,我们的方法为实际预测系统提供了强大的新方向。

英文摘要

Time-series forecasting is a critical task in various business domains, but it remains inherently challenging. Typically, large forecasting models are trained in a single, resource-intensive run. Once training is completed, a natural question arises:~\emph{is there still potential for meaningful improvement in the model's performance?} Motivated by techniques from boosting, we introduce the concept of~\emph{post-training corrections}. This approach enhances a trained forecaster by sequentially applying a carefully selected set of corrections to its predictions. Our method offers a lightweight, model-agnostic, and scalable strategy to improve forecasting performance in practical settings. We provide theoretical foundations for the approach, starting with the affine correction case, and analyze the expected performance gains and computational costs in more general settings. Across a range of benchmark datasets, our method consistently delivers up to a $30\%$ improvement in forecasting accuracy over existing state-of-the-art models, with minimal computational overhead.

2504.15587 2026-06-04 cs.LG cs.AI

MetaMolGen: A Neural Graph Motif Generation Model for De Novo Molecular Design

MetaMolGen: 一种用于从头分子设计的神经图基序生成模型

Zimo Yan, Jie Zhang, Zheng Xie, Chang Liu, Yizhen Liu, Yiping Song

发表机构 * National University of Defense Technology(国防科技大学)

AI总结 提出基于元学习的分子生成模型MetaMolGen,通过标准化图基序分布和轻量级自回归序列模型,实现少样本和属性条件分子生成。

详情
AI中文摘要

分子生成在药物发现和材料科学中扮演重要角色,尤其是在数据稀缺场景下,传统生成模型往往难以实现令人满意的条件泛化。为应对这一挑战,我们提出MetaMolGen,一种基于一阶元学习的分子生成器,专为少样本和属性条件分子生成而设计。MetaMolGen通过将图基序映射到标准化潜在空间来标准化其分布,并采用轻量级自回归序列模型生成忠实反映底层分子结构的SMILES序列。此外,它通过集成到生成过程中的可学习属性投影器,支持具有目标属性的分子的条件生成。实验结果表明,MetaMolGen在低数据条件下持续生成有效且多样的SMILES序列,优于传统基线。这突显了其在快速适应和高效条件生成方面的优势,适用于实际分子设计。

英文摘要

Molecular generation plays an important role in drug discovery and materials science, especially in data-scarce scenarios where traditional generative models often struggle to achieve satisfactory conditional generalization. To address this challenge, we propose MetaMolGen, a first-order meta-learning-based molecular generator designed for few-shot and property-conditioned molecular generation. MetaMolGen standardizes the distribution of graph motifs by mapping them to a normalized latent space, and employs a lightweight autoregressive sequence model to generate SMILES sequences that faithfully reflect the underlying molecular structure. In addition, it supports conditional generation of molecules with target properties through a learnable property projector integrated into the generative process.Experimental results demonstrate that MetaMolGen consistently generates valid and diverse SMILES sequences under low-data regimes, outperforming conventional baselines. This highlights its advantage in fast adaptation and efficient conditional generation for practical molecular design.

2504.12329 2026-06-04 cs.CL cs.AI

Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

推测性思考:在推理时利用大模型指导增强小模型推理能力

Wang Yang, Xiang Yue, Vipin Chaudhary, Xiaotian Han

发表机构 * Case Western Reserve University(凯斯西储大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出一种无需训练的推测性思考框架,通过让大推理模型在推理层面引导小模型,在提升小模型推理准确率的同时缩短输出长度。

详情
AI中文摘要

近期进展利用后训练来增强模型推理性能,这通常需要昂贵的训练流程,并且仍然存在低效、输出过长的问题。我们提出推测性思考,一种无需训练的框架,使大推理模型在推理层面引导小模型进行推理,区别于在词元层面操作的推测解码。我们的方法基于两个观察:(1)支持推理的词元(如“wait”)经常出现在结构分隔符(如“\n\n”)之后,作为反思或继续的信号;(2)大模型对反思行为有更强的控制,减少不必要的回溯同时提高推理质量。通过策略性地将反思步骤委托给能力更强的模型,我们的方法显著提升了推理模型的推理准确率,同时缩短了输出。在32B推理模型的辅助下,1.5B模型在MATH500上的准确率从83.2%提升至89.4%,实现了6.2%的大幅提升。同时,平均输出长度从5439个词元减少到4583个词元,下降了15.7%。此外,当应用于非推理模型(Qwen-2.5-7B-Instruct)时,我们的框架在相同基准上将准确率从74.0%提升至81.8%,实现了7.8%的相对提升。

英文摘要

Recent advances leverage post-training to enhance model reasoning performance, which typically requires costly training pipelines and still suffers from inefficient, overly lengthy outputs. We introduce Speculative Thinking, a training-free framework that enables large reasoning models to guide smaller ones during inference at the reasoning level, distinct from speculative decoding, which operates at the token level. Our approach is based on two observations: (1) reasoning-supportive tokens such as "wait" frequently appear after structural delimiters like "\n\n", serving as signals for reflection or continuation; and (2) larger models exhibit stronger control over reflective behavior, reducing unnecessary backtracking while improving reasoning quality. By strategically delegating reflective steps to a more capable model, our method significantly boosts the reasoning accuracy of reasoning models while shortening their output. With the assistance of the 32B reasoning model, the 1.5B model's accuracy on MATH500 increases from 83.2% to 89.4%, marking a substantial improvement of 6.2%. Simultaneously, the average output length is reduced from 5439 tokens to 4583 tokens, representing a 15.7% decrease. Moreover, when applied to a non-reasoning model (Qwen-2.5-7B-Instruct), our framework boosts its accuracy from 74.0% to 81.8% on the same benchmark, achieving a relative improvement of 7.8%.

2405.08036 2026-06-04 cs.LG cs.AI

Potentially Optimal Joint Actions Recognition for Cooperative Multi-Agent Reinforcement Learning

合作多智能体强化学习中潜在最优联合动作识别

Chang Huang, Shatong Zhu, Junqiao Zhao, Hongtu Zhou, Di Zhang, Hai Zhang, Chen Ye, Ziqiao Wang, Guang Chen

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) Stanford University(斯坦福大学) MOE Key Lab of Embedded System and Service Computing, Tongji University, Shanghai, China(同济大学嵌入式系统与服务计算教育部重点实验室,上海,中国) The University of Hong Kong(香港大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 针对值函数分解中单调性约束限制表达能力的问题,提出潜在最优联合动作加权方法,通过迭代加权训练保证最优策略恢复,在多个任务上超越现有方法。

Comments ICLR 2026

Journal ref ICLR 2026

详情
AI中文摘要

值函数分解在合作多智能体强化学习(MARL)中被广泛使用。现有方法通常对联合动作值与个体动作值之间施加单调性约束以实现分散执行。然而,此类约束限制了值函数分解的表达能力,缩小了可表示的联合动作值范围,并阻碍了最优策略的学习。为解决这一问题,我们提出了潜在最优联合动作加权(POW)方法,该方法在现有近似加权策略可能失效的情况下确保最优策略恢复。POW通过一个理论上有依据的迭代加权训练过程,迭代地识别潜在最优联合动作并为其分配更高的训练权重。我们证明该机制保证了真实最优策略的恢复,克服了先前启发式加权策略的局限性。POW是架构无关的,可以无缝集成到现有的值函数分解算法中。在矩阵博弈、难度增强的捕食者-猎物任务、SMAC、SMACv2以及高速公路环境交叉口场景上的大量实验表明,POW显著提升了稳定性,并持续超越了最先进的基于值的MARL方法。

英文摘要

Value function factorization is widely used in cooperative multi-agent reinforcement learning (MARL). Existing approaches often impose monotonicity constraints between the joint action value and individual action values to enable decentralized execution. However, such constraints limit the expressiveness of value factorization, restricting the range of joint action values that can be represented and hindering the learning of optimal policies. To address this, we propose Potentially Optimal Joint Actions Weighting (POW), a method that ensures optimal policy recovery where existing approximate weighting strategies may fail. POW iteratively identifies potentially optimal joint actions and assigns them higher training weights through a theoretically grounded iterative weighted training process. We prove that this mechanism guarantees recovery of the true optimal policy, overcoming the limitations of prior heuristic weighting strategies. POW is architecture-agnostic and can be seamlessly integrated into existing value factorization algorithms. Extensive experiments on matrix games, difficulty-enhanced predator-prey tasks, SMAC, SMACv2, and a highway-env intersection scenario show that POW substantially improves stability and consistently surpasses state-of-the-art value-based MARL methods.

2307.00862 2026-06-04 cs.CV cs.CL

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

UniFine: 一种统一且细粒度的零样本视觉-语言理解方法

Rui Sun, Zhecan Wang, Haoxuan You, Noel Codella, Kai-Wei Chang, Shih-Fu Chang

发表机构 * Columbia University(哥伦比亚大学) Microsoft Research(微软研究院) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出UniFine框架,通过利用句子关键词和图像对象等细粒度信息进行图像-文本匹配,在零样本设置下统一处理VQA、SNLI-VE和VCR等视觉-语言任务,并在多个数据集上取得显著改进。

Comments 14 pages, 4 figures, ACL 2023 Findings

详情
AI中文摘要

视觉-语言任务,如VQA、SNLI-VE和VCR,具有挑战性,因为它们需要模型的推理能力来理解视觉世界和自然语言的语义。针对视觉-语言任务的监督方法已被充分研究。然而,在零样本设置下解决这些任务的研究较少。由于对比语言-图像预训练(CLIP)在图像-文本匹配上展现了显著的零样本性能,先前的工作通过将视觉-语言任务转换为图像-文本匹配问题来利用其强大的零样本能力,并且它们主要考虑全局级别的匹配(例如,整个图像或句子)。然而,我们发现视觉和文本的细粒度信息,例如句子中的关键词和图像中的对象,对于语义理解可能相当有信息量。受此启发,我们提出了一个统一框架,利用细粒度信息进行零样本视觉-语言学习,涵盖多个任务,如VQA、SNLI-VE和VCR。我们的实验表明,我们的框架在VQA上优于先前的零样本方法,并在SNLI-VE和VCR上取得了显著改进。此外,我们的消融研究证实了我们提出的方法的有效性和泛化性。

英文摘要

Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our experiments show that our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our ablation studies confirm the effectiveness and generalizability of our proposed method.

2503.10629 2026-06-04 cs.CV

Hierarchical Self-Supervised Adversarial Training for Robust Vision Models in Histopathology

层次化自监督对抗训练用于组织病理学中的鲁棒视觉模型

Hashmat Shadab Malik, Shahina Kunhimon, Muzammal Naseer, Fahad Shahbaz Khan, Salman Khan

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence(Mohamed Bin Zayed人工智能大学) Khalifa University(卡勒比大学) Linköping University(林霍尔姆大学) Australian National University(澳大利亚国立大学)

AI总结 提出层次化自监督对抗训练(HSAT),利用组织病理图像的患者-切片-补丁层次结构进行多级对比学习,生成对抗样本并整合到对抗训练中,在OpenSRH数据集上白盒设置平均提升54.31%,黑盒设置性能下降降至3-4%。

Comments Accepted at 28th International Conference On Medical Image Computing And Computer Assisted Intervention (MICCAI 2025)

详情
AI中文摘要

对抗攻击对医疗等关键领域的视觉模型构成重大挑战,这些领域可靠性至关重要。尽管对抗训练在自然图像中已得到充分研究,但其在生物医学和显微镜数据中的应用仍然有限。现有的自监督对抗训练方法忽视了组织病理图像的层次结构,其中患者-切片-补丁关系提供了有价值的判别信号。为了解决这一问题,我们提出了层次化自监督对抗训练(HSAT),它利用这些属性通过多级对比学习生成对抗样本,并将其整合到对抗训练中以增强鲁棒性。我们在多类组织病理数据集OpenSRH上评估了HSAT,结果表明HSAT在生物医学和自然图像领域均优于现有方法。HSAT增强了鲁棒性,在白盒设置中平均提升54.31%,在黑盒设置中将性能下降降至3-4%,而基线为25-30%。这些结果为该领域的对抗训练树立了新的基准,为更鲁棒的模型铺平了道路。我们的训练和评估代码可在https://github.com/HashmatShadab/HSAT获取。

英文摘要

Adversarial attacks pose significant challenges for vision models in critical fields like healthcare, where reliability is essential. Although adversarial training has been well studied in natural images, its application to biomedical and microscopy data remains limited. Existing self-supervised adversarial training methods overlook the hierarchical structure of histopathology images, where patient-slide-patch relationships provide valuable discriminative signals. To address this, we propose Hierarchical Self-Supervised Adversarial Training (HSAT), which exploits these properties to craft adversarial examples using multi-level contrastive learning and integrate it into adversarial training for enhanced robustness. We evaluate HSAT on multiclass histopathology dataset OpenSRH and the results show that HSAT outperforms existing methods from both biomedical and natural image domains. HSAT enhances robustness, achieving an average gain of 54.31% in the white-box setting and reducing performance drops to 3-4% in the black-box setting, compared to 25-30% for the baseline. These results set a new benchmark for adversarial training in this domain, paving the way for more robust models. Our Code for training and evaluation is available at https://github.com/HashmatShadab/HSAT.

2407.03884 2026-06-04 cs.CL cs.AI

ChatSOP: An SOP-Guided MCTS Planning Framework for Controllable LLM Dialogue Agents

ChatSOP: 一种SOP引导的MCTS规划框架,用于可控的LLM对话代理

Zhigen Li, Jianxiang Peng, Yanmeng Wang, Yong Cao, Tianhao Shen, Minghui Zhang, Linxi Su, Shang Wu, Yihang Wu, Yuqian Wang, Ye Wang, Wei Hu, Jianfeng Li, Shaojun Wang, Jing Xiao, Deyi Xiong

发表机构 * TJUNLP Lab, College of Intelligence and Computing, Tianjin University(天津大学智能计算学院TJUNLP实验室) Ping An Technology(平安科技) Tübingen AI Center, University of Tübingen(图宾根大学图宾根人工智能中心) Kunming University of Science and Technology(昆明理工大学)

AI总结 提出ChatSOP框架,通过SOP引导的蒙特卡洛树搜索增强LLM对话代理的可控性,在动作准确率上相比GPT-3.5基线提升27.95%。

Comments Accepted to ACL 2025 main

Journal ref Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17637-17659, 2025

详情
AI中文摘要

由大型语言模型驱动的对话代理在各种任务中表现出优越的性能。尽管它们能更好地理解用户并生成类人回复,但**缺乏可控性**仍然是一个关键挑战,常常导致对话偏离主题或任务失败。为了解决这个问题,我们引入标准操作程序来规范对话流程。具体来说,我们提出了**ChatSOP**,一种新颖的SOP引导的蒙特卡洛树搜索规划框架,旨在增强LLM驱动的对话代理的可控性。为此,我们整理了一个数据集,包含使用GPT-4o的半自动角色扮演系统生成的、经过严格人工质量控制验证的SOP标注的多场景对话。此外,我们提出了一种新方法,将思维链推理与监督微调相结合用于SOP预测,并利用SOP引导的蒙特卡洛树搜索在对话中进行最优动作规划。实验结果表明了我们方法的有效性,例如,与基于GPT-3.5的基线模型相比,动作准确率提高了27.95%,并且在开源模型上也显示出显著的提升。数据集和代码已公开。

英文摘要

Dialogue agents powered by Large Language Models (LLMs) show superior performance in various tasks. Despite the better user understanding and human-like responses, their **lack of controllability** remains a key challenge, often leading to unfocused conversations or task failure. To address this, we introduce Standard Operating Procedure (SOP) to regulate dialogue flow. Specifically, we propose **ChatSOP**, a novel SOP-guided Monte Carlo Tree Search (MCTS) planning framework designed to enhance the controllability of LLM-driven dialogue agents. To enable this, we curate a dataset comprising SOP-annotated multi-scenario dialogues, generated using a semi-automated role-playing system with GPT-4o and validated through strict manual quality control. Additionally, we propose a novel method that integrates Chain of Thought reasoning with supervised fine-tuning for SOP prediction and utilizes SOP-guided Monte Carlo Tree Search for optimal action planning during dialogues. Experimental results demonstrate the effectiveness of our method, such as achieving a 27.95% improvement in action accuracy compared to baseline models based on GPT-3.5 and also showing notable gains for open-source models. Dataset and codes are publicly available.

2502.08870 2026-06-04 cs.LG stat.ML

When and why randomised exploration works (in linear bandits)

随机探索何时以及为何有效(在线性赌博机中)

Marc Abeille, David Janz, Ciara Pike-Burke

发表机构 * Criteo AI Lab(Criteo AI实验室) University of Oxford(牛津大学) Imperial College London(伦敦帝国学院)

AI总结 本文提出一种不依赖强制乐观或后验膨胀的分析方法,证明在动作空间光滑且强凸的d维线性赌博机中,随机探索算法(如汤普森采样)可实现O(d√n log(n))的n步遗憾界,首次表明在非平凡线性赌博机设置中汤普森采样能达到最优维度依赖。

Comments Minor corrections to formulas and text; results unchanged

详情
AI中文摘要

我们提供了一种分析随机探索算法(如汤普森采样)的方法,该方法不依赖于强制乐观或后验膨胀。通过这种方法,我们证明在$d$维线性赌博机设置中,当动作空间光滑且强凸时,随机探索算法享有$O(d\sqrt{n} \log(n))$阶的$n$步遗憾界。值得注意的是,这首次表明存在非平凡的线性赌博机设置,其中汤普森采样可以在遗憾中实现最优维度依赖。

英文摘要

We provide an approach for the analysis of randomised exploration algorithms like Thompson sampling that does not rely on forced optimism or posterior inflation. With this, we demonstrate that in the $d$-dimensional linear bandit setting, when the action space is smooth and strongly convex, randomised exploration algorithms enjoy an $n$-step regret bound of the order $O(d\sqrt{n} \log(n))$. Notably, this shows for the first time that there exist non-trivial linear bandit settings where Thompson sampling can achieve optimal dimension dependence in the regret.

2408.01382 2026-06-04 cs.LG cs.GT

Explaining a probabilistic prediction on the simplex with Shapley compositions

用Shapley组合解释单纯形上的概率预测

Paul-Gauthier Noé, Miquel Perelló-Nieto, Jean-François Bonastre, Peter Flach

发表机构 * Laboratoire Informatique d’Avignon, Avignon Université, France(阿维尼昂信息实验室,阿维尼昂大学,法国) University of Bristol, United Kingdom(布里斯托大学,英国)

AI总结 本文引入Shapley组合,利用成分数据分析的Aitchison几何,为多类概率预测提供了一种基于公理的解释方法。

Comments Published in ECAI2024's proceedings

详情
AI中文摘要

源于博弈论的Shapley值被广泛用于通过量化每个特征值对预测的贡献来解释机器学习模型的预测。这需要像二分类中那样的标量预测,而多类概率预测是离散概率分布,位于多维单纯形上。在这种多类设置中,Shapley值通常以一对多的方式单独计算每个类别,忽略了输出分布的组成性质。在本文中,我们引入Shapley组合作为一种有根据的方法来正确解释多类概率预测,使用成分数据分析中的Aitchison几何。我们证明了Shapley组合是满足Aitchison单纯形上的线性性、对称性和效率的唯一量,扩展了标准Shapley值的相应公理性质。我们在一系列场景中展示了这种正确的多类处理。

英文摘要

Originating in game theory, Shapley values are widely used for explaining a machine learning model's prediction by quantifying the contribution of each feature's value to the prediction. This requires a scalar prediction as in binary classification, whereas a multiclass probabilistic prediction is a discrete probability distribution, living on a multidimensional simplex. In such a multiclass setting the Shapley values are typically computed separately on each class in a one-vs-rest manner, ignoring the compositional nature of the output distribution. In this paper, we introduce Shapley compositions as a well-founded way to properly explain a multiclass probabilistic prediction, using the Aitchison geometry from compositional data analysis. We prove that the Shapley composition is the unique quantity satisfying linearity, symmetry and efficiency on the Aitchison simplex, extending the corresponding axiomatic properties of the standard Shapley value. We demonstrate this proper multiclass treatment in a range of scenarios.

2408.11121 2026-06-04 cs.LG cs.AI cs.CL cs.CR

DOMBA: Double Model Balancing for Access-Controlled Language Models via Minimum-Bounded Aggregation

DOMBA: 通过最小有界聚合实现访问控制语言模型的双模型平衡

Tom Segal, Asaf Shabtai, Yuval Elovici

发表机构 * Ben-Gurion University(本·古里安大学)

AI总结 提出DOMBA方法,通过最小有界平均函数聚合两个不同访问级别文档训练的语言模型的概率分布,在保证安全性的同时实现高效用。

Comments Code: https://github.com/ppo1/DOMBA 11 pages, 3 figures

Journal ref Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39, pp. 25101-25109, 2025

详情
AI中文摘要

大型语言模型(LLMs)的实用性在很大程度上取决于其训练数据的质量和数量。许多组织拥有大量数据语料库,可用于训练或微调针对其特定需求的LLMs。然而,这些数据集通常带有基于用户权限并由访问控制机制强制执行的访问限制。在此类数据集上训练LLMs可能导致敏感信息暴露给未经授权的用户。防止此类暴露的一种直接方法是为每个访问级别训练一个单独的模型。然而,由于每个模型的训练数据量相对于整个组织语料库的总量有限,这可能导致模型效用低下。另一种方法是在所有数据上训练单个LLM,同时限制未经授权信息的暴露。然而,当前针对LLMs的暴露限制方法对于访问控制数据无效,因为敏感信息在多个训练样本中频繁出现。我们提出DOMBA——双模型平衡——一种训练和部署LLMs的简单方法,可在提供高效用和访问控制功能的同时保证安全性。DOMBA使用“最小有界”平均函数(一个受较小值约束的函数,例如调和平均)聚合两个模型的概率分布,每个模型在具有(可能多个)不同访问级别的文档上训练。详细的数学分析和广泛评估表明,DOMBA在保护受限信息的同时,提供了与非安全模型相当的效用。

英文摘要

The utility of large language models (LLMs) depends heavily on the quality and quantity of their training data. Many organizations possess large data corpora that could be leveraged to train or fine-tune LLMs tailored to their specific needs. However, these datasets often come with access restrictions that are based on user privileges and enforced by access control mechanisms. Training LLMs on such datasets could result in exposure of sensitive information to unauthorized users. A straightforward approach for preventing such exposure is to train a separate model for each access level. This, however, may result in low utility models due to the limited amount of training data per model compared to the amount in the entire organizational corpus. Another approach is to train a single LLM on all the data while limiting the exposure of unauthorized information. However, current exposure-limiting methods for LLMs are ineffective for access-controlled data, where sensitive information appears frequently across many training examples. We propose DOMBA - double model balancing - a simple approach for training and deploying LLMs that provides high utility and access-control functionality with security guarantees. DOMBA aggregates the probability distributions of two models, each trained on documents with (potentially many) different access levels, using a "min-bounded" average function (a function that is bounded by the smaller value, e.g., harmonic mean). A detailed mathematical analysis and extensive evaluation show that DOMBA safeguards restricted information while offering utility comparable to non-secure models.

2502.01576 2026-06-04 cs.CV

Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

Robust-LLaVA:大规模鲁棒图像编码器对多模态大语言模型的有效性

Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Khan, Salman Khan

发表机构 * Mohamed bin Zayed University of AI(Mohamed bin Zayed人工智能大学) Khalifa University(卡利法大学) Michigan State University(密歇根州立大学) Australian National University(澳大利亚国立大学)

AI总结 本文提出利用大规模对抗预训练的图像分类模型替代CLIP编码器,以增强多模态大语言模型对视觉对抗扰动的鲁棒性,在无需额外对抗训练的情况下,在视觉问答、图像描述和越狱攻击任务中取得显著鲁棒性提升。

Comments Accepted at Trustworthy FMs Workshop Trust Before Use: Building Foundation Models that You Can Trust (ICCVW) 2025

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉-语言任务中表现出色,但仍然容易受到视觉对抗扰动的影响,这些扰动可能引发幻觉、操纵响应或绕过安全机制。现有方法通过在ImageNet规模的数据上对CLIP视觉编码器进行受限的对抗微调来缓解这些风险,从而保持其泛化能力。然而,这种有限的对抗训练限制了鲁棒性和更广泛的泛化。在这项工作中,我们探索了一种替代方法,即利用在大规模数据上经过对抗预训练的现有视觉分类模型。我们的分析揭示了两个主要贡献:(1)对抗预训练的广泛规模和多样性使得这些模型能够对各种对抗威胁表现出优越的鲁棒性,范围从不可察觉的扰动到高级越狱尝试,而无需额外的对抗训练;(2)将这些鲁棒模型与MLLM进行端到端集成,有助于语言组件更好地适应鲁棒视觉特征,在复杂推理任务上优于现有的即插即用方法。通过在视觉问答、图像描述和越狱攻击上的系统评估,我们证明使用这些鲁棒模型训练的MLLM在保持良好干净性能的同时,实现了优越的对抗鲁棒性。我们的框架在描述和VQA任务中分别实现了2倍和1.5倍的平均鲁棒性增益,并在越狱攻击中提供了超过10%的改进。代码和预训练模型将在https://github.com/HashmatShadab/Robust-LLaVA 提供。

英文摘要

Multi-modal Large Language Models (MLLMs) excel in vision-language tasks but remain vulnerable to visual adversarial perturbations that can induce hallucinations, manipulate responses, or bypass safety mechanisms. Existing methods seek to mitigate these risks by applying constrained adversarial fine-tuning to CLIP vision encoders on ImageNet-scale data, ensuring their generalization ability is preserved. However, this limited adversarial training restricts robustness and broader generalization. In this work, we explore an alternative approach of leveraging existing vision classification models that have been adversarially pre-trained on large-scale data. Our analysis reveals two principal contributions: (1) the extensive scale and diversity of adversarial pre-training enables these models to demonstrate superior robustness against diverse adversarial threats, ranging from imperceptible perturbations to advanced jailbreaking attempts, without requiring additional adversarial training, and (2) end-to-end MLLM integration with these robust models facilitates enhanced adaptation of language components to robust visual features, outperforming existing plug-and-play methodologies on complex reasoning tasks. Through systematic evaluation across visual question-answering, image captioning, and jail-break attacks, we demonstrate that MLLMs trained with these robust models achieve superior adversarial robustness while maintaining favorable clean performance. Our framework achieves 2x and 1.5x average robustness gains in captioning and VQA tasks, respectively, and delivers over 10% improvement against jailbreak attacks. Code and pretrained models will be available at https://github.com/HashmatShadab/Robust-LLaVA.

2412.20803 2026-06-04 cs.CV

Scalable Event Cloud Network for Event-based Classification

可扩展事件云网络用于基于事件的分类

Hongwei Ren, Fei Ma, Xiaopeng Lin, Yuetong Fang, Hongxiang Huang, Yue Zhou, Yulong Huang, Haotian Fu, Ziyi Yang, Youxin Jiang, Xiangqian Wu, Bojun Cheng

发表机构 * Research Centre for Multimodal Artificial Intelligence(多模态人工智能研究中心) Applications, Faculty of Computing, Harbin Institute of Technology(应用学院,哈尔滨工业大学) MICS Thrust, Hong Kong University of Science(科学与技术大学(香港)MICS研究方向) Guangdong Laboratory of Artificial Intelligence(广东人工智能与数字经济实验室)

AI总结 提出SECNet,通过结构级极性集成和频域特征提取,解决事件云表示在空间和时间分辨率上的可扩展性问题,在十个数据集上验证了有效性。

Comments ICML2026 Oral

详情
AI中文摘要

事件相机是受生物启发的传感器,引起了工业界和学术界的广泛关注。主流方法倾向于帧和体素表示,这些方法在达到满意性能的同时,引入了耗时的转换、庞大的模型,并牺牲了细粒度的时间信息。相比之下,点云表示在解决上述弱点方面显示出潜力,但在抽象更高空间分辨率和更长时序事件的特征方面可扩展性有限。在本文中,我们提出了一种名为SECNet的可扩展网络,以利用事件云表示。SECNet通过创新的基于事件的分组和采样模块,在结构层面而非仅在输入层面集成极性。为了适应事件数量的激增,SECNet通过傅里叶变换在频域中进行特征提取。这种方法不仅显著减少了乘累加操作的爆炸,而且有效地抽象了时空特征。我们在 extbf{十个}基于事件的数据集上进行了大量实验,验证了SECNet的可扩展性、有效性和效率。我们的代码将在以下网址提供:https://github.com/rhwxmx/SECNet_ICML。

英文摘要

Event cameras are biologically inspired sensors garnering significant attention from both industry and academia. Mainstream methods favor frame and voxel representations, which reach a satisfactory performance while introducing time-consuming transformations, bulky models, and sacrificing fine-grained temporal information. Alternatively, Point Cloud representation demonstrates promise in addressing the mentioned weaknesses, but it has limited scalability in abstracting features of higher spatial resolution and longer temporal sequence events. In this paper, we propose a Scalable Network named SECNet to leverage Event Cloud representation. SECNet integrates polarity at the structural level by innovating the Event-based Group and Sampling module rather than only at the input level. To accommodate the surge in the number of events, SECNet embraces feature extraction in the frequency domain via the Fourier transform.This approach not only substantially extinguishes the explosion of Multiply Accumulate Operations but also effectively abstracts spatio-temporal features. We conducted extensive experiments on \textbf{ten} event-based datasets, and substantiate the scalability, effectiveness, and efficiency of SECNet. Our code will be available at: https://github.com/rhwxmx/SECNet_ICML.

2412.06095 2026-06-04 cs.CL cs.FL cs.IT math.IT

Measuring Grammatical Diversity from Small Corpora: Derivational Entropy Rates, Mean Length of Utterances, and Annotation Invariance

从小语料库测量语法多样性:派生熵率、平均话语长度和注释不变性

Fermin Moscoso del Prado Martin

发表机构 * Department of Computer Science and Technology & Jesus College University of Cambridge(计算机科学与技术系及耶稣学院,剑桥大学)

AI总结 本文从理论和实证上证明语法的派生熵与其生成的话语平均长度(MLU)之间存在根本联系,提出派生熵率作为衡量语法复杂性的新指标,并引入平滑诱导树库熵(SITE)从小树库中准确估计这些度量。

Journal ref Computational Linguistics (2025) 51 (4): 1191-1233

详情
AI中文摘要

在许多领域,如语言习得、语言神经心理学、衰老研究和历史语言学,语料库被用于估计个体、社区或说话者类型在一段时间内产生的语法结构的多样性。在这些情况下,树库被视为可能遇到的句法结构的代表性样本。从小型语料库中记录的结构推广潜在的句法多样性需要谨慎的外推,其准确性受到代表性子语料库规模有限的制约。在本文中,我从理论和实证上证明,语法的派生熵与其生成的话语平均长度(MLU)之间存在根本联系,从而产生了一个新的度量——派生熵率。话语平均长度成为句法复杂性最实用的指标;我证明MLU不仅仅是一个代理,而是语法多样性的基本度量。结合新的派生熵率度量,它提供了一种无理论的语法复杂性评估。派生熵率索引了不同语法注释框架确定树库语法复杂性的速率。我引入了平滑诱导树库熵(SITE)作为准确估计这些度量的工具,即使从非常小的树库中也能做到。最后,我讨论了这些结果对自然语言处理和人类语言处理的重要启示。

英文摘要

In many fields, such as language acquisition, neuropsychology of language, the study of aging, and historical linguistics, corpora are used for estimating the diversity of grammatical structures that are produced during a period by an individual, community, or type of speakers. In these cases, treebanks are taken as representative samples of the syntactic structures that might be encountered. Generalizing the potential syntactic diversity from the structures documented in a small corpus requires careful extrapolation whose accuracy is constrained by the limited size of representative sub-corpora. In this article, I demonstrate -- theoretically, and empirically -- that a grammar's derivational entropy and the mean length of the utterances (MLU) it generates are fundamentally linked, giving rise to a new measure, the derivational entropy rate. The mean length of utterances becomes the most practical index of syntactic complexity; I demonstrate that MLU is not a mere proxy, but a fundamental measure of syntactic diversity. In combination with the new derivational entropy rate measure, it provides a theory-free assessment of grammatical complexity. The derivational entropy rate indexes the rate at which different grammatical annotation frameworks determine the grammatical complexity of treebanks. I introduce the Smoothed Induced Treebank Entropy (SITE) as a tool for estimating these measures accurately, even from very small treebanks. I conclude by discussing important implications of these results for both NLP and human language processing.

2411.19758 2026-06-04 cs.CV cs.AI cs.LG

LaVIDE: Language-Prompted Satellite Change Detection via Map-Image Alignment

LaVIDE: 通过地图-图像对齐的语言提示卫星变化检测

Shuguo Jiang, Fang Xu, Chuandong Liu, Hong Tan, Shengyang Li, Lei Yu, Wen Yang, Sen Jia, Gui-Song Xia

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) School of Artificial Intelligence, Wuhan University(武汉大学人工智能学院) Technology and Engineering Center for Space Utilization and the Key Laboratory of Space Utilization, Chinese Academy of Sciences(中国科学院空间利用技术与重点实验室) School of Aeronautics and Astronautics, University of Chinese Academy of Sciences(中国科学院大学航空宇航学院) School of Electronic Information, Wuhan University(武汉大学电子信息学院) College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院)

AI总结 提出LaVIDE框架,利用受限提示学习和对象感知嵌入增强,通过语言弥合高层地图类别与低层图像细节之间的语义鸿沟,实现跨模态对齐,在多类与单类变化检测任务上分别提升IoU 18.4%和5.2%。

详情
AI中文摘要

基于地图参考和最新图像的遥感变化检测,在缺乏早期图像进行比较时,有助于及时观测地球表面。然而,高层地图类别与低层图像细节之间的语义鸿沟阻碍了提取同质特征以进行稳健的时间关联。与比较像素级视觉相似性或传播分割误差的传统方法不同,我们提出了一种新颖框架——LaVIDE(用于检测变化的语言-视觉判别器),该框架以语言为中介,弥合了高层地图类别与低层图像细节之间的语义鸿沟。具体来说,我们引入了受限提示学习来生成上下文感知的文本提示,使地图语义与图像内容对齐,并采用对象感知嵌入增强策略将对象级属性(如形状、边界)整合到地图表示中。这些组件能够在统一的语言-视觉特征空间中实现稳健的跨模态对齐。在四个基准数据集(DynamicEarthNet、HRSCD、BANDON和SECOND)上的大量实验表明,LaVIDE以显著优势超越了最先进的方法,在多类和单类变化检测任务上分别实现了18.4%和5.2%的IoU提升。我们的框架不仅提高了地图-图像变化检测的准确性,还为以最少人工干预快速更新地图提供了实用解决方案,有望在城市规划、灾害评估和生态保护等领域产生广泛影响。代码和数据集可在 https://github.com/ShuGuoJ/LAVIDE.git 获取。

英文摘要

Remote sensing change detection based on a map reference and an up-to-date image boosts timely observation of the Earth's surface when earlier images are lacking for comparison. However, the semantic gap between high-level map categories and low-level image details hinders the extraction of homogeneous features for robust temporal association in change detection. Unlike conventional approaches that either compare pixel-level visual similarity or propagate segmentation errors, \textcolor{black}{we propose a novel framework, \underline{La}nguage-\underline{VI}sion \underline{D}iscriminator for d\underline{E}tecting changes, LaVIDE}, which bridges the semantic gap between high-level map categories and low-level image details using language as an intermediary. Specifically, we introduce {\it restricted prompt learning} to generate context-aware textual prompts that align map semantics with image content, and an {\it object-aware embedding enhancement} strategy to integrate object-level attributes (e.g., shape, boundary) into map representations. These components enable robust cross-modal alignment within a unified language-vision feature space. Extensive experiments on four benchmarks, DynamicEarthNet, HRSCD, BANDON, and SECOND, demonstrate that LaVIDE outperforms state-of-the-art methods by significant margins, achieving $18.4\%$ and $5.2\%$ improvements in IoU on multi-class and single-class change detection tasks, respectively. Our framework not only advances the accuracy of map-image change detection but also provides a practical solution for rapid map updating with minimal human intervention, promising broad impacts in urban planning, disaster assessment, and ecological conservation. Code and datasets are available at: https://github.com/ShuGuoJ/LAVIDE.git.

2409.11901 2026-06-04 cs.CL

LLMs + Persona-Plug = Personalized LLMs

LLMs + Persona-Plug = 个性化大语言模型

Jiongnan Liu, Yutao Zhu, Shuting Wang, Xiaochi Wei, Erxue Min, Yu Lu, Shuaiqiang Wang, Dawei Yin, Zhicheng Dou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院 Gallagher 学院) Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE(下一代智能搜索与推荐工程研究中心,教育部) Baidu Inc.(百度公司)

AI总结 提出一种轻量级插件式用户嵌入模块PPersona-Plug,通过建模用户历史上下文生成个性化嵌入,无需微调即可提升大语言模型输出个性化程度。

详情
AI中文摘要

个性化在众多语言任务和应用中扮演着关键角色,因为具有相同需求的用户可能根据个人兴趣偏好不同的输出。这促使了各种个性化方法的发展,旨在使大语言模型(LLMs)适应生成符合用户偏好的定制化输出。其中一些方法涉及为每个用户微调一个独特的个性化LLM,这过于昂贵而难以广泛应用。替代方法以即插即用的方式引入个性化信息,通过检索用户相关历史文本作为示例。然而,这种基于检索的策略可能会破坏用户历史的连续性,无法捕捉用户的整体风格和模式,从而导致次优性能。为了解决这些挑战,我们提出了一种新颖的个性化LLM模型PPersona-Plug。它通过一个轻量级的插件式用户嵌入模块,对每个用户的所有历史上下文进行建模,构建用户特定的嵌入。通过将该嵌入附加到任务输入中,LLMs可以更好地理解和捕捉用户的习惯与偏好,从而在不调整自身参数的情况下生成更个性化的输出。在语言模型个性化(LaMP)基准上的各种任务上的大量实验表明,所提出的模型显著优于现有的个性化LLM方法。

英文摘要

Personalization plays a critical role in numerous language tasks and applications, since users with the same requirements may prefer diverse outputs based on their individual interests. This has led to the development of various personalized approaches aimed at adapting large language models (LLMs) to generate customized outputs aligned with user preferences. Some of them involve fine-tuning a unique personalized LLM for each user, which is too expensive for widespread application. Alternative approaches introduce personalization information in a plug-and-play manner by retrieving the user's relevant historical texts as demonstrations. However, this retrieval-based strategy may break the continuity of the user history and fail to capture the user's overall styles and patterns, hence leading to sub-optimal performance. To address these challenges, we propose a novel personalized LLM model, PPlug. It constructs a user-specific embedding for each individual by modeling all her historical contexts through a lightweight plug-in user embedder module. By attaching this embedding to the task input, LLMs can better understand and capture user habits and preferences, thereby producing more personalized outputs without tuning their own parameters. Extensive experiments on various tasks in the language model personalization (LaMP) benchmark demonstrate that the proposed model significantly outperforms existing personalized LLM approaches.

2406.09407 2026-06-04 cs.CV

Towards Evaluating the Robustness of Visual State Space Models

评估视觉状态空间模型的鲁棒性

Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Shahbaz Khan, Salman Khan

发表机构 * Mohamed Bin Zayed University of AI(Mohamed Bin Zayed人工智能大学) Center of Secure Cyber-Physical Security Systems(安全的网络物理安全系统中心) Linköping University(林波伊大学) Australian National University(澳大利亚国立大学)

AI总结 本文全面评估了视觉状态空间模型(VSSMs)在遮挡、图像结构、常见损坏和对抗攻击等多种扰动下的鲁棒性,并与Transformer和CNN等架构进行比较,揭示了其优势和局限性。

Comments Accepted at The 5th Workshop of Adversarial Machine Learning on Computer Vision (CVPRW 2025)

详情
AI中文摘要

视觉状态空间模型(VSSMs)是一种结合了循环神经网络和潜变量模型优势的新型架构,通过有效捕捉长程依赖和建模复杂视觉动态,在视觉感知任务中表现出色。然而,它们在自然和对抗扰动下的鲁棒性仍然是一个关键问题。在这项工作中,我们全面评估了VSSMs在各种扰动场景下的鲁棒性,包括遮挡、图像结构、常见损坏和对抗攻击,并将其性能与Transformer和卷积神经网络等成熟架构进行比较。此外,我们研究了VSSMs在复杂视觉场景中针对物体-背景组合变化的鲁棒性,使用了专门设计用于测试模型性能的复杂基准。我们还使用模拟真实场景的损坏数据集评估了它们在目标检测和分割任务上的鲁棒性。为了更深入地理解VSSMs的对抗鲁棒性,我们进行了基于频率的对抗攻击分析,评估了它们对低频和高频扰动的性能。我们的发现突出了VSSMs在处理复杂视觉损坏方面的优势和局限性,为未来研究提供了宝贵的见解。我们的代码和模型将在 https://github.com/HashmatShadab/MambaRobustness 提供。

英文摘要

Vision State Space Models (VSSMs), a novel architecture that combines the strengths of recurrent neural networks and latent variable models, have demonstrated remarkable performance in visual perception tasks by efficiently capturing long-range dependencies and modeling complex visual dynamics. However, their robustness under natural and adversarial perturbations remains a critical concern. In this work, we present a comprehensive evaluation of VSSMs' robustness under various perturbation scenarios, including occlusions, image structure, common corruptions, and adversarial attacks, and compare their performance to well-established architectures such as transformers and Convolutional Neural Networks. Furthermore, we investigate the resilience of VSSMs to object-background compositional changes on sophisticated benchmarks designed to test model performance in complex visual scenes. We also assess their robustness on object detection and segmentation tasks using corrupted datasets that mimic real-world scenarios. To gain a deeper understanding of VSSMs' adversarial robustness, we conduct a frequency-based analysis of adversarial attacks, evaluating their performance against low-frequency and high-frequency perturbations. Our findings highlight the strengths and limitations of VSSMs in handling complex visual corruptions, offering valuable insights for future research. Our code and models will be available at https://github.com/HashmatShadab/MambaRobustness.

2407.13922 2026-06-04 cs.CV cs.AI cs.LG

CounterFace: A Synthetic Face Dataset for Fine-Grained Counterfactual Evaluation of Face Recognition Systems

CounterFace: 用于人脸识别系统细粒度反事实评估的合成人脸数据集

Guruprasad Viswanathan Ramesh, Ashish Hooda, Shimaa Ahmed, Harrison J Rosenberg, Ramya Korlakai Vinayak, Kassem Fawaz

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Visa Research(Visa研究)

AI总结 提出CounterFace数据集,通过全自动流水线生成包含20种面部属性和8种人口统计因素的11,821个反事实人脸对,用于细粒度评估人脸识别系统在特定属性-人口统计组合下的性能退化。

Comments Code available at https://github.com/Guruprasad68/counterface_facct2026. Dataset available for non-commercial research upon request

详情
AI中文摘要

人脸识别系统广泛应用于关键应用,因此其在不同人群和条件下的可靠性和鲁棒性至关重要。人脸识别系统的标准评估通常依赖LFW等数据集来估计平均识别准确率。一些基准测试也捕捉了粗粒度的身份内变化,如老化、姿态和光照。然而,人脸存在更细粒度的变化,包括发型和化妆等外观变化,这些在现有基准测试中代表性不足。反事实评估提供了一种在细粒度变化下评估人脸识别鲁棒性的方法。然而,现有使用图像生成器合成的反事实人脸数据集由于在流程中使用人工验证,属性覆盖范围有限。我们提出CounterFace,一个新的反事实评估数据集,包含20种面部属性和8种人口统计因素,超过先前合成人脸数据集14种属性和2种人口统计因素。该数据集使用基于现成图像生成器和自定义验证器的全自动流水线生成,无需人工验证。CounterFace包含11,821个反事实人脸对,事后用户研究证实了生成反事实的忠实性。我们评估了两个商业和四个开源人脸识别系统(AWS Rekognition、Face++、AdaFace、MagFace、ArcFace、FaceNet)在160种属性-人口统计组合上的性能。与标准评估基准不同,我们的数据集有助于隔离单个系统的精确故障模式。结果表明,所有六个系统的性能退化因属性和人口统计而异,遮挡属性(如口罩和胡须)普遍降低性能。

英文摘要

Face recognition (FR) systems are widely deployed in critical applications, making their reliability and robustness across diverse populations and conditions essential. Standard evaluation of FR systems typically relies on datasets such as LFW to estimate average recognition accuracy. Some benchmarks also capture coarse-grained intra-identity variations such as aging, pose, and lighting. However, human faces undergo more fine-grained changes, including appearance changes such as hairstyles and makeup, that are underrepresented in existing benchmarks. Counterfactual evaluation provides a method to assess FR robustness under such fine-grained variations. Existing counterfactual face datasets synthesized with image generators, however, are limited in attribute coverage due to the use of humans for verification in the pipeline. We propose CounterFace, a new counterfactual evaluation dataset comprising 20 facial attributes and 8 demographic factors, exceeding prior synthetic face datasets by 14 attributes and 2 demographics. The dataset is generated using a fully automated pipeline based on off-the-shelf image generators with custom verifiers, removing human need for verification. CounterFace contains 11,821 counterfactual face pairs, and a post-hoc user study confirms the faithfulness of the generated counterfactuals. We evaluate two commercial and four open-source FR systems (AWS Rekognition, Face++, AdaFace, MagFace, ArcFace, FaceNet) across 160 attribute-demographic combinations. Our dataset helps in the isolation of precise failure modes for individual systems unlike standard evaluation benchmarks. Results indicate that the performance degradation varies across attributes and demographics for all six systems and occluding attributes (e.g., facemask and facial hair) universally degrade performance.

2404.11309 2026-06-04 cs.CV

Achieving Rotation-Invariant Convolution via Non-Learnable Orientation Alignment Operators

通过不可学习的朝向对齐算子实现旋转不变卷积

Hanlin Mo, Peihong Lei, You Hao, Guoying Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于不可学习算子的旋转不变卷积(RIConvs),其参数量和计算过程与标准卷积相同,在多个视觉任务中提升准确率,尤其在数据有限时效果显著。

详情
AI中文摘要

在深度神经网络中实现旋转不变性而无需数据增强是一个研究热点。内在不变性使特征能够捕捉目标的固有属性,从而提升深度学习在视觉任务中的性能。基于多种类型的不可学习算子,本文提出了一套对任意旋转自然不变的卷积操作。与大多数先前方法不同,这些旋转不变卷积(RIConvs)具有与标准卷积相同的可学习参数数量和相似的计算过程,因此可以互换。使用MNIST-Rot数据集,我们验证了它们在不同旋转角度下的不变性,并与先前的旋转不变CNN进行了比较,其中两种基于梯度的RIConvs取得了最先进的结果。然后,我们将RIConvs与经典CNN骨干网络集成,并在纹理识别、飞机类型识别和遥感图像分类任务上进行了评估。结果表明,RIConvs显著提高了准确率,特别是在训练数据有限的情况下,并且即使在使用数据增强时也能提升性能。

英文摘要

Achieving rotational invariance in deep neural networks without data augmentation is a research hotspot. Intrinsic invariance enables features to capture targets' inherent properties, enhancing deep learning performance in visual tasks. Based on various types of non-learnable operators, this paper proposes a comprehensive set of convolution operations that are natually invariant to arbitrary rotations. Unlike most prior methods, these rotation-invariant convolutions (RIConvs) have the same number of learnable parameters and a similar computational process as standard convolutions, making them interchangeable. Using the MNIST-Rot dataset, we validate their invariance across rotation angles and compare them with previous rotation-invariant CNNs, where two gradient-based RIConvs achieve state-of-the-art results. Then, we integrate RIConvs with classic CNN backbones and evaluate them on texture recognition, aircraft type recognition, and remote sensing image classification tasks. Results show that RIConvs significantly improve accuracy, particularly with limited training data, and enhance performance even with data augmentation.

1905.04235 2026-06-04 cs.RO cs.SY eess.SY

Autonomous Locomotion Mode Transition in Quadruped Track-Legged Robots: A Simulation-Based Analysis for Step Negotiation

四足履轮腿机器人自主运动模式切换:基于仿真的步阶跨越分析

Jie Wang, Krispin Davies

发表机构 * University of Cambridge(剑桥大学) ClearPath AI

AI总结 本文提出了一种用于四足混合机器人自主切换运动模式的方法,特别是在跨越不同高度台阶时,通过能量效率评估机制实现平稳过渡。

详情
AI中文摘要

混合履轮腿机器人结合了轮式和腿式运动的优势,通过高效切换滚动和行走模式,在多种地形中实现适应性。然而,自动实现这些切换仍然是重大挑战。本文介绍了一种用于四足混合机器人自主模式切换的方法,特别是在跨越台阶时。我们的方法基于一种决策机制,利用所提出的基于能量的准则评估两种运动模式的能量效率。为了确保平稳跨越台阶,我们结合了两种攀爬步态,用于评估行走运动的能量使用情况。仿真结果验证了该方法的有效性,显示在不同高度的台阶上实现了成功的自主切换。我们提出的方法具有通用性,可以修改以适应类似机械配置的其他混合机器人,前提是其运动能量性能已先进行研究。

英文摘要

Hybrid track/wheel-legged robots combine the advantages of wheel-based and leg-based locomotion, granting adaptability across varied terrains through efficient transitions between rolling and walking modes. However, automating these transitions remains a significant challenge. In this paper, we introduce a method designed for autonomous mode transition in a quadruped hybrid robot with a track/wheel-legged configuration, especially during step negotiation. Our approach hinges on a decision-making mechanism that evaluates the energy efficiency of both locomotion modes using a proposed energy-based criterion. To guarantee a smooth negotiation of steps, we incorporate two climbing gaits designated for the assessment of energy usage in walking locomotion. Simulation results validate the method's effectiveness, showing successful autonomous transitions across steps of diverse heights. Our suggested approach has universal applicability and can be modified to suit other hybrid robots of similar mechanical configuration, provided their locomotion energy performance is studied beforehand.