arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2370
2601.06431 2026-05-29 cs.AI

LsrIF: Enhancing Logic-Structured Instruction Following of Large Language Models

LsrIF: 增强大语言模型的逻辑结构化指令遵循能力

Qingyu Ren, Qianyu He, Jingwen Chang, Geng Zhang, Jiajie Zhu, Xingzhou Chen, Zhuofei Shi, Jiaqing Liang, Yanghua Xiao, Han Xia, Zeye Sun, Fei Yu

AI总结 提出LsrIF框架,通过构建并行、顺序、条件和嵌套结构的原子约束数据,并采用结构感知的奖励聚合方法,提升大语言模型在逻辑结构化指令遵循任务中的表现。

详情
AI中文摘要

指令遵循对于大语言模型至关重要,然而现实世界中的指令通常涉及具有逻辑结构的多个约束,例如并行组合、顺序依赖和条件分支。现有方法通常通过简单组合约束来构建数据,并在训练过程中通过平均各个约束分数来聚合奖励,忽略了逻辑依赖关系并引入了噪声信号。我们提出LsrIF,一个用于逻辑结构化指令遵循的训练框架。LsrIF通过将原子约束组织成并行、顺序、条件和嵌套结构来构建数据,并应用与其执行语义一致的结构感知奖励聚合:对并行约束取平均奖励,在顺序结构中早期失败后衰减后续奖励,在条件结构中仅奖励活跃分支。实验表明,LsrIF在领域内和领域外设置中均提升了指令遵循能力,同时也有利于逻辑推理。进一步分析表明,逻辑结构化训练增加了对约束相关词元和逻辑连接词的注意力,表明模型对指令逻辑的建模得到改善。我们将发布我们的数据和代码以供未来研究。

英文摘要

Instruction following is critical for large language models, yet real-world instructions often involve multiple constraints with logical structures, such as parallel composition, sequential dependencies, and conditional branching. Existing methods typically construct data by simply combining constraints and aggregate rewards by averaging individual constraint scores during training, overlooking logical dependencies and introducing noisy signals. We propose LsrIF, a training framework for logic-structured instruction following. LsrIF constructs data by organizing atomic constraints into parallel, sequential, conditional, and nested structures, and applies structure-aware reward aggregation aligned with their execution semantics: averaging rewards for parallel constraints, decaying later rewards after early failures in sequential structures, and rewarding only active branches in conditional structures. Experiments show that LsrIF improves instruction following in both in-domain and out-of-domain settings while also benefiting logic reasoning. Further analysis indicates that logic-structured training increases attention to constraint-related tokens and logical connectors, suggesting improved modeling of instruction logic. We will release our data and code for future research.

2601.04633 2026-05-29 cs.CL

MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark

MAGA-Bench: 通过对齐检测基准的机器增强生成文本

Anyang Song, Ying Cheng, Yiqian Xu, Rui Feng

AI总结 提出MAGA基准,集成多种对齐方法(从提示构建到生成器-检测器对抗强化学习及推理过程)以增强机器生成文本的人类对齐性,提升检测器的泛化能力。

详情
AI中文摘要

机器生成文本(MGT)越来越难以与人类书写文本(HWT)区分。这一趋势加剧了虚假新闻和在线欺诈等恶意活动。微调检测器的泛化能力严重依赖数据集质量,仅仅扩大MGT来源可能越来越不足,需要进一步增强生成过程。基于HC-Var理论,增强MGT的人类对齐性不仅有助于现有检测器的鲁棒性测试,还能提升在此类对齐MGT数据集上微调的检测器的泛化能力。因此,我们提出了 extbf{M}achine- extbf{A}ugment- extbf{G}enerated Text via extbf{A}lignment (MAGA) 检测基准。MAGA集成了多种对齐方法,从提示构建到 extbf{G}enerator- extbf{D}etector extbf{A}dversarial extbf{R}einforcement extbf{L}earning (GDARL) 以及推理过程。在我们的实验中,在MAGA上微调的RoBERTa检测器在泛化AUC上平均提升4.60%。相反,MAGA中的对齐MGT也导致所选检测器的AUC平均下降8.13%。我们希望MAGA基准能为未来MGT检测器泛化能力的研究提供有价值的见解。

英文摘要

Machine-Generated Text (MGT) is becoming increasingly difficult to distinguish from Human-Written Text (HWT). This trend has exacerbated malicious activities such as fake news and online fraud. The generalization ability of fine-tuned detectors relies heavily on dataset quality, and simply expanding the sources of MGT may become increasingly insufficient. Further augmentation of the generation process is required. Based on HC-Var's theory, enhancing the human-like alignment of MGT not only facilitates robustness testing of existing detectors but also boosts the generalization ability of detectors fine-tuned on such aligned MGT datasets. Therefore, we propose the \textbf{M}achine-\textbf{A}ugment-\textbf{G}enerated Text via \textbf{A}lignment (MAGA) Detection Benchmark. MAGA integrates several alignment methods, ranging from prompt construction to \textbf{G}enerator-\textbf{D}etector \textbf{A}dversarial \textbf{R}einforcement \textbf{L}earning (GDARL) and the reasoning process. In our experiments, the RoBERTa detector fine-tuned on MAGA achieves an average improvement of 4.60\% in generalization AUC. Conversely, the aligned MGTs in MAGA also lead to an average decrease of 8.13\% in the AUC of selected detectors. We hope the MAGA Benchmark will provide valuable insights for future research on the generalization ability of MGT detectors.

2601.03134 2026-05-29 cs.CL

The Anatomy of Conversational Scams: A Topic-Based Red Teaming Analysis of Multi-Turn Interactions in LLMs

对话式诈骗的剖析:基于主题的LLM多轮交互红队分析

Xiangzhe Yuan, Zhenhao Zhang, Haoming Tang, Siying Hu

AI总结 通过LLM间模拟框架研究多轮社交工程对话中的对抗动态,分析攻击与防御策略,发现跨模型和跨语言的结果差异及策略转换的结构性变化。

详情
AI中文摘要

随着LLM通过扩展对话获得说服能力,它们为研究扩展交互环境中的对抗性对话行为创造了新的机会,而传统的单轮安全评估无法捕捉这些行为。我们使用受控的LLM到LLM模拟框架,系统地研究了这些交互动态,用于跨语言社交工程场景的自动红队测试。评估了八种最先进的英中模型,分析了对话级别结果,标注了攻击者和防御者策略家族,并建模了它们之间的交互动态。结果表明,多轮对抗性对话遵循反复出现的升级模式,而防御性响应通常依赖于验证、延迟和通道控制。我们进一步发现结果分布在统计上显著的跨模型和跨语言差异,转换分析揭示了防御者策略在不同语言中应对攻击者战术的系统性结构变化。这些发现强调了研究多轮对抗性对话设置中交互结构的重要性,并展示了受控的LLM到LLM模拟如何支持对抗性对话动态的机制分析。

英文摘要

As LLMs gain persuasive capabilities through extended dialogues, they create new opportunities for studying adversarial conversational behavior in extended interaction settings that traditional single-turn safety evaluations fail to capture. We systematically study these interactional dynamics using a controlled LLM-to-LLM simulation framework for automated red-teaming across bilingual social engineering scenarios. Evaluating eight state-of-the-art models in English and Chinese, we analyze dialogue-level outcomes, annotate attacker and defender strategy families, and model interaction dynamics between them. Results show that multi-turn adversarial dialogues follow recurrent escalation patterns, while defensive responses frequently rely on verification, delay, and channel control. We further find statistically significant cross-model and cross-lingual differences in outcome distributions, and transition analysis reveals systematic structural variation in how defender strategies respond to attacker tactics across languages. These findings highlight the importance of studying interactional structure in multi-turn adversarial dialogue settings and demonstrate how controlled LLM-to-LLM simulations can support mechanistic analysis of adversarial conversational dynamics.

2601.02094 2026-05-29 cs.LG math.FA

Horizon Activation Mapping for Neural Networks in Time Series Forecasting

时间序列预测中神经网络的Horizon Activation Mapping

Krupakar Hans, V A Kandappan

AI总结 提出一种基于梯度范数平均的视觉可解释性技术HAM,用于跨不同神经网络家族的时间序列预测模型选择与比较。

Comments Accepted and Presented in International Conference on Optimization and Learning (OLA2026) for publication in LNCS

详情
AI中文摘要

时间序列预测的神经网络依赖于误差度量和特定架构的可解释性方法进行模型选择,但这些方法不适用于不同家族的模型。为了解释对最先进模型家族中层类型不可知的预测模型,我们引入了Horizon Activation Mapping (HAM),这是一种受grad-CAM启发的视觉可解释性技术,使用梯度范数平均来研究horizon的子序列,而grad-CAM则研究图像数据上的注意力图。我们引入了因果和反因果模式,以计算每个时间步上子序列的梯度更新范数平均值,以及表示范数平均值均匀分布的等比例线。研究了优化景观相对于批次大小变化、早停、训练-验证-测试分割、架构选择、单变量预测和dropout的影响,并与HAM中的子序列性能相关联。有趣的是,基于批次大小的活动差异似乎表明每个epoch之间存在指数近似的可能性。本研究使用在ETTm2数据集上训练的多变量预测模型,包括基于MLP的CycleNet、N-Linear、N-HITS、基于自注意力的FEDformer、Pyraformer、基于SSM的SpaceTime和基于扩散的多分辨率DDPM,在不同horizon大小下生成HAM图。NHITS的神经逼近定理和SpaceTime的指数自回归活动被归因于其训练、验证和测试集上HAM图的趋势。总的来说,HAM可用于细粒度模型选择、验证集选择以及不同神经网络模型家族之间的比较。

英文摘要

Neural networks for time series forecasting have relied on error metrics and architecture-specific interpretability approaches for model selection that don't apply across models of different families. To interpret forecasting models agnostic to the types of layers across state-of-the-art model families, we introduce Horizon Activation Mapping (HAM), a visual interpretability technique inspired by grad-CAM that uses gradient norm averages to study the horizon's subseries where grad-CAM studies attention maps over image data. We introduce causal and anti-causal modes to calculate gradient update norm averages across subseries at every timestep and lines of proportionality signifying uniform distributions of the norm averages. Optimization landscape studies with respect to changes in batch sizes, early stopping, train-val-test splits, architectural choices, univariate forecasting and dropouts are studied with respect to performances and subseries in HAM. Interestingly, batch size based differences in activities seem to indicate potential for existence of an exponential approximation across them per epoch relative to each other. Multivariate forecasting models including MLP-based CycleNet, N-Linear, N-HITS, self attention-based FEDformer, Pyraformer, SSM-based SpaceTime and diffusion-based Multi-Resolution DDPM over different horizon sizes trained over the ETTm2 dataset are used for HAM plots in this study. NHITS' neural approximation theorem and SpaceTime's exponential autoregressive activities have been attributed to trends in HAM plots over their training, validation and test sets. In general, HAM can be used for granular model selection, validation set choices and comparisons across different neural network model families.

2601.01162 2026-05-29 cs.LG cs.AI cs.CL

Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models

弥合分类数据聚类的语义鸿沟:基于大语言模型的方法

Zihua Yang, Xin Liao, Yiqun Zhang, Yiu-ming Cheung

AI总结 提出BREVE框架,利用外部知识库的语义嵌入丰富分类属性值,并引入自适应权重平衡原始标识与语义信息,在八个基准数据集上平均ARI排名达1.3。

Comments Accepted to ICPR2027

详情
AI中文摘要

定性数据广泛存在于医疗、营销和生物信息学等领域,聚类是其中模式发现的基本工具。定性数据聚类的核心困难在于度量属性值之间的相似性,这些属性值没有固有的顺序或距离。为了恢复这种关系,现有研究通常依赖于数据集内的共现统计。然而,当样本量较小时,这种统计路径变得不可靠,每个值的语义上下文因此未被充分利用。受此限制,本文提出BREVE(通过外部值丰富实现平衡表示),一种聚类框架,通过从外部知识库中提取额外的语义维度来丰富每个定性值。即,每个唯一值被扩展为一个密集嵌入,编码其语义内容。为了防止原始值身份被添加的维度稀释,进一步附加一个轻量级的独热编码组件。然后,由聚类紧致性引导的自适应权重决定富集维度进入最终表示的强度。通过这种设计,在八个基准数据集上的实验表明,与七个代表性竞争者相比,平均ARI排名为1.3。

英文摘要

Qualitative data are widespread in domains such as healthcare, marketing, and bioinformatics, where clustering offers a fundamental tool for pattern discovery. A core difficulty of qualitative-data clustering lies in measuring similarity among attribute values that carry no inherent ordering or distance. To recover such relationships, existing studies typically rely on within-dataset co-occurrence statistics. This statistical route, however, becomes unreliable once the sample size is small, and the semantic context of each value is therefore left underexploited. Motivated by this limitation, this paper proposes BREVE (Balanced Representation via External Value Enrichment), a clustering framework that enriches each qualitative value with extra semantic dimensions drawn from an external knowledge base. That is, every unique value is expanded by a dense embedding that encodes its semantic content. To prevent the original value identity from being diluted by the added dimensions, a lightweight one-hot component is further appended. An adaptive weight, guided by cluster compactness, then determines how strongly the enrichment dimensions enter the final representation. With this design, experiments on eight benchmark datasets yield an average ARI rank of 1.3 against seven representative competitors.

2512.24562 2026-05-29 cs.CL

HaluNet: Learning Hallucination Risk from Internal Signals in LLM Question Answering

HaluNet:从LLM问答内部信号学习幻觉风险

Chaodong Tong, Qi Zhang, Zhuojun Jiang, Lei Jiang, Yanbing Liu

AI总结 提出HaluNet,利用单次生成的内部信号(token似然、预测熵和隐藏状态)估计答案级幻觉风险,通过LLM-as-a-Judge弱监督训练,在多个QA数据集上显著提升风险排序和错误检测率。

Comments 16 pages, 12 tables, and 11 figures. This version includes a major revision of the manuscript and updates the author list with the consent of all involved authors

详情
AI中文摘要

大型语言模型(LLM)在问答(QA)任务中表现出色,但可能生成流畅但缺乏证据支持的答案。现有的幻觉检测器通常依赖外部验证、重复采样或测试时评判调用,这对于实时问答来说成本高昂。我们提出 extbf{HaluNet},一种轻量级幻觉风险估计器,利用单次模型生成的内部信号。HaluNet联合建模token似然、预测熵和隐藏状态信息,从而允许概率、分布和语义证据共同构成答案级风险评分。它使用LLM-as-a-Judge标签作为可扩展的弱监督进行训练,并通过独立的人工和多评判评估进行验证。在SQuAD、TriviaQA和Natural Questions上的实验表明,HaluNet在领域内和跨领域设置中均改进了答案级风险排序。在300个样本的人工评估中,HaluNet达到了0.874的AUROC和0.869的AUPRC;其前20%高风险答案包含96.5%的错误,相比基础错误率实现了2.06倍的提升。

英文摘要

Large language models (LLMs) achieve strong question answering (QA) performance but can produce fluent answers unsupported by available evidence. Existing hallucination detectors often rely on external verification, repeated sampling, or test-time judge calls, which can be costly for real-time QA. We propose \textbf{HaluNet}, a lightweight hallucination risk estimator that uses internal signals from one model generation. HaluNet jointly models token likelihood, predictive entropy, and hidden-state information, allowing probabilistic, distributional, and semantic evidence to inform an answer-level risk score. It is trained with LLM-as-a-Judge labels as scalable weak supervision and evaluated with independent human and multi-judge assessments. Experiments on SQuAD, TriviaQA, and Natural Questions show that HaluNet improves answer-level risk ranking across in-domain and out-of-domain settings. On a 300-example human evaluation, HaluNet achieves 0.874 AUROC and 0.869 AUPRC; its top 20\% highest-risk answers contain 96.5\% errors, yielding a 2.06$\times$ lift over the base error rate.

2512.21032 2026-05-29 cs.CV

Multi-Attribute guided Thermal Face Image Translation based on Latent Diffusion Model

基于潜在扩散模型的多属性引导热人脸图像翻译

Mingshu Cai, Osamu Yoshie, Yuya Ieiri

AI总结 提出一种基于潜在扩散模型的多属性引导方法,从热红外图像生成高质量可见光人脸图像,同时保留关键身份特征,解决异质人脸识别中的域偏移和特征丢失问题。

Comments Accepted by 2025 IEEE International Joint Conference on Biometrics (IJCB 2025)

详情
Journal ref
2025 IEEE International Joint Conference on Biometrics (IJCB), 2025
AI中文摘要

现代监控系统越来越依赖多波长传感器和深度神经网络来识别夜间拍摄的红外图像中的人脸。然而,大多数人脸识别模型是在可见光数据集上训练的,由于显著的域偏移,在红外输入上性能大幅下降。早期的基于特征的红外人脸识别方法被证明效果不佳,促使研究人员采用生成式方法将红外图像转换为可见光图像以提高识别性能。这种被称为异质人脸识别(HFR)的范式面临模型和模态差异等挑战,导致生成图像出现失真和特征丢失。为了解决这些限制,本文引入了一种新颖的基于潜在扩散的模型,旨在从热输入生成高质量的可见光人脸图像,同时保留关键身份特征。我们集成一个多属性分类器,从可见光图像中提取关键面部属性,减轻红外到可见光图像恢复过程中的特征丢失。此外,我们提出了Self-attn Mamba模块,该模块增强了跨模态特征的全局建模,并显著提高了推理速度。在两个基准数据集上的实验结果表明,我们的方法在图像质量和身份保持方面均达到了最先进的性能。

英文摘要

Modern surveillance systems increasingly rely on multi-wavelength sensors and deep neural networks to recognize faces in infrared images captured at night. However, most facial recognition models are trained on visible light datasets, leading to substantial performance degradation on infrared inputs due to significant domain shifts. Early feature-based methods for infrared face recognition proved ineffective, prompting researchers to adopt generative approaches that convert infrared images into visible light images for improved recognition. This paradigm, known as Heterogeneous Face Recognition (HFR), faces challenges such as model and modality discrepancies, leading to distortion and feature loss in generated images. To address these limitations, this paper introduces a novel latent diffusion-based model designed to generate high-quality visible face images from thermal inputs while preserving critical identity features. A multi-attribute classifier is incorporated to extract key facial attributes from visible images, mitigating feature loss during infrared-to-visible image restoration. Additionally, we propose the Self-attn Mamba module, which enhances global modeling of cross-modal features and significantly improves inference speed. Experimental results on two benchmark datasets demonstrate the superiority of our approach, achieving state-of-the-art performance in both image quality and identity preservation.

2512.17220 2026-05-29 cs.CL

Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding

心智景观感知的检索增强生成以改进长上下文理解

Yuqing Li, Jiangnan Li, Zheng Lin, Ziyan Zhou, Junjie Wu, Weiping Wang, Jie Zhou, Mo Yu

AI总结 提出MiA-RAG框架,通过层次化摘要构建心智景观并统一检索与生成的条件,实现全局语义表示指导下的长上下文检索与推理。

详情
AI中文摘要

人类通过依赖内容的整体语义表示来理解长而复杂的文本。这种全局视图有助于组织先验知识、解释新信息以及整合分散在文档中的证据,正如心理学中人类的心智景观感知能力所揭示的那样。当前的检索增强生成(RAG)系统缺乏这种指导,因此在长上下文任务中表现不佳。在本文中,我们提出了心智景观感知RAG(MiA-RAG),这是第一个将心智景观感知检索和生成表述为基于LLM的RAG的统一条件范式的框架。MiA-RAG通过层次化摘要构建心智景观,并将检索和生成都条件于这种全局语义表示。这使得检索器能够形成丰富的查询嵌入,生成器能够在连贯的全局上下文中对检索到的证据进行推理。我们在多样化的长上下文和双语基准上评估了MiA-RAG,用于基于证据的理解和全局意义构建。它持续超越基线,进一步分析表明,它将局部细节与连贯的全局表示对齐,实现了更类人的长上下文检索和推理。

英文摘要

Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks. In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first framework to formulate mindscape-aware retrieval and generation as a unified conditioning paradigm for LLM-based RAG. MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation. This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context. We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.

2512.15374 2026-05-29 cs.AI

SCOPE: Prompt Evolution for Enhancing Agent Effectiveness

SCOPE: 通过提示进化增强智能体效能

Zehua Pei, Hui-Ling Zhen, Shixiong Kai, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, Bei Yu

AI总结 针对LLM智能体静态提示无法有效管理动态上下文导致失败的问题,提出SCOPE框架,将上下文管理建模为在线优化问题,通过双流记忆机制和视角驱动探索自动进化提示,在HLE基准上将任务成功率从14.23%提升至38.64%。

详情
AI中文摘要

大型语言模型(LLM)智能体越来越多地部署在生成大规模动态上下文的环境中。然而,一个关键瓶颈仍然存在:虽然智能体可以访问这些上下文,但其静态提示缺乏有效管理上下文的机制,导致反复出现纠正性和增强性失败。为了解决这一能力差距,我们引入了通过提示进化实现自进化上下文优化(SCOPE)。SCOPE将上下文管理视为一个 extit{在线优化}问题,从执行轨迹中综合指导方针,自动进化智能体的提示。我们提出了一种双流机制,在战术记忆(即时错误纠正)和战略记忆之间路由指导方针,后者通过冲突解决、包含剪枝和合并不断优化。为了最大化策略覆盖范围,视角驱动探索进化多个并行提示,由不同的优化视角引导。在HLE基准上的实验表明,SCOPE在没有人工干预的情况下将任务成功率从14.23%提高到38.64%。我们在https://github.com/JarvisPei/SCOPE公开了代码。

英文摘要

Large Language Model (LLM) agents are increasingly deployed in environments that generate massive, dynamic contexts. However, a critical bottleneck remains: while agents have access to this context, their static prompts lack the mechanisms to manage it effectively, leading to recurring Corrective and Enhancement failures. To address this capability gap, we introduce Self-evolving Context Optimization via Prompt Evolution (SCOPE). SCOPE frames context management as an \textit{online optimization} problem, synthesizing guidelines from execution traces to automatically evolve the agent's prompt. We propose a Dual-Stream mechanism that routes guidelines between tactical memory (immediate error correction) and strategic memory, which is continuously refined through conflict resolution, subsumption pruning, and consolidation. To maximize strategy coverage, Perspective-Driven Exploration evolves multiple parallel prompts guided by distinct optimization perspectives. Experiments on the HLE benchmark show that SCOPE improves task success rates from 14.23\% to 38.64\% without human intervention. We make our code publicly available at https://github.com/JarvisPei/SCOPE.

2512.03010 2026-05-29 cs.CV cs.GR cs.RO

SurfFill: Completion of LiDAR Point Clouds via Gaussian Surfel Splatting

SurfFill: 通过高斯曲面元填充完成LiDAR点云

Svenja Strobel, Matthias Innmann, Bernhard Egger, Marc Stamminger, Linus Franke

AI总结 针对LiDAR点云缺失薄结构和边缘细节的问题,提出基于高斯曲面元(Gaussian surfel)的补全方案SurfFill,利用光束发散启发式识别模糊区域并优化曲面元重建以生长新点,在合成和真实场景中优于先前方法。

Comments Project page: https://lfranke.github.io/surffill

详情
AI中文摘要

LiDAR捕获的点云通常被视为主动3D重建的金标准。尽管其在平坦区域精度极高,但捕获容易遗漏小的几何结构,并可能在暗色、吸光材料上失败。或者,拍摄场景的多张照片并应用3D摄影测量可以推断这些细节,因为它们通常代表特征丰富的区域。然而,对于无特征区域,LiDAR的精度很少能达到。因此,我们建议通过引入SurfFill:一种基于高斯曲面元的LiDAR补全方案,结合LiDAR和基于相机的捕获的优势。我们分析LiDAR捕获,并将LiDAR光束发散归因于伪影的主要因素,主要表现为薄结构和边缘。我们利用这一见解,通过评估点云中密度的变化,引入一种用于完成扫描的模糊启发式方法。这使我们能够识别靠近缺失区域的点,然后我们可以使用这些点生长额外的点以完成扫描。对于这种点生长,我们约束高斯曲面元重建,将优化和密集化集中在这些模糊区域。最后,提取模糊区域重建的高斯基元并采样以获取点来完成点云。为了解决大规模重建的挑战,我们将此流程扩展为一种分治方案,用于建筑大小的点云补全。我们在合成和真实场景的LiDAR点云补全任务上评估,发现我们的方法优于先前的重建方法。

英文摘要

LiDAR-captured point clouds are often considered the gold standard in active 3D reconstruction. While their accuracy is exceptional in flat regions, the capturing is susceptible to miss small geometric structures and may fail with dark, absorbent materials. Alternatively, capturing multiple photos of the scene and applying 3D photogrammetry can infer these details as they often represent feature-rich regions. However, the accuracy of LiDAR for featureless regions is rarely reached. Therefore, we suggest combining the strengths of LiDAR and camera-based capture by introducing SurfFill: a Gaussian surfel-based LiDAR completion scheme. We analyze LiDAR capturings and attribute LiDAR beam divergence as a main factor for artifacts, manifesting mostly at thin structures and edges. We use this insight to introduce an ambiguity heuristic for completed scans by evaluating the change in density in the point cloud. This allows us to identify points close to missed areas, which we can then use to grow additional points from to complete the scan. For this point growing, we constrain Gaussian surfel reconstruction to focus optimization and densification on these ambiguous areas. Finally, Gaussian primitives of the reconstruction in ambiguous areas are extracted and sampled for points to complete the point cloud. To address the challenges of large-scale reconstruction, we extend this pipeline with a divide-and-conquer scheme for building-sized point cloud completion. We evaluate on the task of LiDAR point cloud completion of synthetic and real-world scenes and find that our method outperforms previous reconstruction methods.

2512.01334 2026-05-29 cs.CV

AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation

AlignVid: 文本引导图像到视频生成中语义保真度的免训练注意力缩放

Yexin Liu, Wen-Jie Shu, Zile Huang, Haoze Zheng, Yueze Wang, Jingjin Zhu, Manyuan Zhang, Ser-Nam Lim, Harry Yang

AI总结 针对文本引导图像到视频生成中视觉主导导致语义编辑失败的问题,提出免训练注意力缩放调制(ASM)和引导调度(GS)方法,并构建OmitI2V基准,有效提升语义保真度且计算开销可忽略。

详情
AI中文摘要

文本引导的图像到视频生成取得了显著进展,但在执行需要对参考图像进行实质性更改(例如,添加、移除或修改对象)的文本指定编辑时仍存在困难。经验上,我们的分析表明,这源于 extbf{视觉主导},即参考图像导致严重的注意力分散,抑制了模型整合新语义信息的能力。为解决此问题,我们提出 extbf{AlignVid},一种免训练干预方法,重新校准模型内部的注意力分布。基于注意力的能量视角,AlignVid采用注意力缩放调制( extbf{ASM})以降低注意力熵并将焦点集中在语义标记上,同时结合引导调度( extbf{GS})以保持生成稳定性。为严格评估此能力,我们提出 extbf{OmitI2V},一个全面的基准,用于评估对象修改、添加和删除中的提示遵循度。大量实验表明,AlignVid有效增强了语义保真度,且计算开销可忽略。代码和OmitI2V基准可在https://github.com/LAW1223/AlignVid获取。

英文摘要

Text-guided image-to-video generation has made substantial progress, yet it still struggles to execute text-specified edits that require substantial changes to a reference image (\textit{e.g., object addition, removal, or modification}). Empirically, our analysis reveals that this stems from \textbf{visual dominance}, where the reference image causes severe attention dispersion, inhibiting the model's ability to incorporate new semantic information. To address this, we propose \textbf{AlignVid}, a training-free intervention that re-calibrates the model's internal attention distribution. Drawing on an energy-based perspective of attention, AlignVid employs Attention Scaling Modulation (\textbf{ASM}) to reduce attention entropy and concentrate focus on semantic tokens, alongside Guidance Scheduling (\textbf{GS}) to maintain generation stability. To rigorously assess this capability, we present \textbf{OmitI2V}, a comprehensive benchmark for evaluating prompt adherence across object modification, addition, and deletion. Extensive experiments demonstrate that AlignVid effectively enhances semantic fidelity with negligible computational overhead. Code and the OmitI2V benchmark are available at https://github.com/LAW1223/AlignVid.

2512.00283 2026-05-29 cs.LG cs.AI q-bio.QM

BioArc: Discovering Optimal Neural Architectures for Biological Foundation Models

BioArc:发现生物学基础模型的最优神经架构

Yi Fang, Haoran Xu, Jiaxin Han, Sirui Ding, Yizhi Wang, Yue Wang, Xuan Wang

AI总结 针对现有基础模型架构直接迁移至生物学领域时忽视生物数据独特性质的问题,提出BioArc框架,利用神经架构搜索系统探索架构设计空间,发现高性能架构并提炼设计原则,同时提出架构预测方法以高效预测新任务的最优架构。

Comments Accepted at the 43nd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

基础模型已彻底改变了自然语言处理(NLP)和计算机视觉(CV)等多个领域。尽管已有努力将通用AI领域中基础模型的成功迁移到生物学,但现有工作主要直接采用来自通用机器学习领域的现有基础模型架构,而未考虑每种生物数据模态独特的物理化学和结构特性进行系统设计。这导致性能欠佳,因为这些改造后的架构难以捕捉生物数据固有的长程依赖、稀疏信息和复杂的底层“语法”。为解决这一差距,我们引入了BioArc,这是一个新颖的框架,旨在超越直觉驱动的架构设计,转向生物学基础模型的原理性、自动化架构发现。利用神经架构搜索(NAS),BioArc系统性地探索了广阔的架构设计空间,跨多种生物模态评估架构,同时严格分析架构、分词和训练策略之间的相互作用。这一大规模分析识别出新颖的高性能架构,使我们能够提炼出一套经验性设计原则,以指导未来的模型开发。此外,为充分利用这套发现的原理性架构,我们提出并比较了几种架构预测方法,这些方法能够有效且高效地预测新生物学任务的最优架构。总体而言,我们的工作为基础资源和原理性方法论提供了基础,以指导下一代生物学任务特定模型和基础模型的创建。

英文摘要

Foundation models have revolutionized various fields such as natural language processing (NLP) and computer vision (CV). While efforts have been made to transfer the success of the foundation models in general AI domains to biology, existing works focus on directly adopting the existing foundation model architectures from general machine learning domains without a systematic design considering the unique physicochemical and structural properties of each biological data modality. This leads to suboptimal performance, as these repurposed architectures struggle to capture the long-range dependencies, sparse information, and complex underlying ``grammars'' inherent to biological data. To address this gap, we introduce BioArc, a novel framework designed to move beyond intuition-driven architecture design towards principled, automated architecture discovery for biological foundation models. Leveraging Neural Architecture Search (NAS), BioArc systematically explores a vast architecture design space, evaluating architectures across multiple biological modalities while rigorously analyzing the interplay between architecture, tokenization, and training strategies. This large-scale analysis identifies novel, high-performance architectures, allowing us to distill a set of empirical design principles to guide future model development. Furthermore, to make the best of this set of discovered principled architectures, we propose and compare several architecture prediction methods that effectively and efficiently predict optimal architectures for new biological tasks. Overall, our work provides a foundational resource and a principled methodology to guide the creation of the next generation of task-specific and foundation models for biology.

2511.22884 2026-05-29 cs.AI

InsightEval: An Expert-Curated Benchmark for Assessing Insight Discovery in LLM-Driven Data Agents

InsightEval:用于评估LLM驱动数据代理洞察发现能力的专家策划基准

Zhenghao Zhu, Yuanfeng Song, Xin Chen, Chengzhong Liu, Yakun Cui, Caleb Chen Cao, Sirui Han, Yike Guo

AI总结 针对现有洞察发现基准InsightBench的缺陷,提出数据整理流程构建新基准InsightEval,并引入新指标衡量代理的探索性能。

详情
AI中文摘要

数据分析已成为科学研究不可或缺的一部分。为了发现隐藏在大量数据集中的潜在知识和洞察,我们需要进行深入的探索性分析以实现其全部价值。随着大语言模型(LLM)和多智能体系统的出现,越来越多的研究人员利用这些技术进行洞察发现。然而,评估洞察发现能力的基准很少。作为现有最全面的框架之一,InsightBench也存在许多关键缺陷:格式不一致、目标设计不当以及冗余洞察。这些问题可能严重影响数据质量和代理评估。为了解决这些问题,我们深入研究了InsightBench的缺点,并提出了高质量洞察基准的基本标准。据此,我们开发了一个数据整理流程,构建了一个名为InsightEval的新数据集。我们进一步引入了一种新的度量标准来衡量代理的探索性能。通过在InsightEval上的大量实验,我们突出了自动化洞察发现中的普遍挑战,并提出了一些关键发现以指导这一有前景方向的未来研究。

英文摘要

Data analysis has become an indispensable part of scientific research. To discover the latent knowledge and insights hidden within massive datasets, we need to perform deep exploratory analysis to realize their full value. With the advent of large language models (LLMs) and multi-agent systems, more and more researchers are making use of these technologies for insight discovery. However, there are few benchmarks for evaluating insight discovery capabilities. As one of the most comprehensive existing frameworks, InsightBench also suffers from many critical flaws: format inconsistencies, poorly conceived objectives, and redundant insights. These issues may significantly affect the quality of data and the evaluation of agents. To address these issues, we thoroughly investigate shortcomings in InsightBench and propose essential criteria for a high-quality insight benchmark. Regarding this, we develop a data-curation pipeline to construct a new dataset named InsightEval. We further introduce a novel metric to measure the exploratory performance of agents. Through extensive experiments on InsightEval, we highlight prevailing challenges in automated insight discovery and raise some key findings to guide future research in this promising direction.

2511.14584 2026-05-29 cs.LG cs.AI

ReflexGrad: Within-Episode Failure Recovery in LLM Agents via Progress-Gated Dual-Process Routing

ReflexGrad: 基于进度门控双过程路由的LLM智能体片段内故障恢复

Ankush Kadu, Aswanth Krishnan

AI总结 提出ReflexGrad双过程架构,通过进度门控路由在无演示条件下实现LLM智能体片段内故障恢复,显著提升任务成功率。

Comments 18 pages, 4 figures, 10 tables. Accepted at ICML 2026 FoGen Workshop

详情
AI中文摘要

我们提出ReflexGrad,一种用于LLM智能体在无演示条件下进行片段内故障恢复的双过程架构。当智能体过早采用错误方法并耗尽步骤预算时,故障后的轨迹包含逃脱所需的信息——但尚无已发表的架构在单个片段内利用这一信息。ReflexGrad在快速过程(每k=3步进行TextGrad风格的连续优化)和慢速过程(当m=5个连续低进度分数触发路由门时进行Reflexion风格的因果诊断)之间进行路由。确定性优先级合并保持自然语言策略的一致性,每次慢速激活产生三个可观察的产物:可复现的触发器、因果诊断和验证后的修复。在ALFWorld 134个任务、n=10个种子、无演示条件下,ReflexGrad将Qwen-3-8B从35.1%提升至75.4%(+40.3个百分点),超过计算匹配的1-shot LATS 2.7个百分点(p≈0.01)、ToT 5.7个百分点(p<10^{-4})和Self-Refine 6.7个百分点(p<10^{-5});在GPT-5上提升从46.3%至88.1%(+41.8个百分点)。1.5个百分点的跨模型差异在种子噪声范围内(p≈0.13),表明路由机制而非模型规模是增益的主要来源。代码、提示、逐种子日志和敏感性扫描已发布。

英文摘要

We present ReflexGrad, a dual-process architecture for within-episode failure recovery in LLM agents without demonstrations. When agents commit to a wrong approach early and exhaust the step budget, the post-failure trajectory contains the information to escape -- but no published architecture acts on it within a single episode. ReflexGrad routes between a fast process (TextGrad-style continuous refinement every $k{=}3$ steps) and a slow process (Reflexion-style causal diagnosis when $m{=}5$ consecutive low-progress scores fire a routing gate). A deterministic priority merge keeps the natural-language policy coherent, and each slow activation emits three observable artifacts: a reproducible trigger, a causal diagnostic, and a verified fix. On ALFWorld 134 tasks, $n{=}10$ seeds, no demonstrations, ReflexGrad lifts Qwen-3-8B from $35.1\%$ to $75.4\%$ ($+40.3$pp), beating compute-matched 1-shot LATS by $+2.7$pp ($p{\approx}0.01$), ToT by $+5.7$pp ($p{<}10^{-4}$), and Self-Refine by $+6.7$pp ($p{<}10^{-5}$); on GPT-5 the lift is $46.3{\to}88.1\%$ ($+41.8$pp). The $1.5$pp cross-model difference is within seed noise ($p{\approx}0.13$), suggesting that the routing mechanism, rather than model scale, is the primary source of the gain. Code, prompts, per-seed logs, and sensitivity sweeps are released.

2511.11703 2026-05-29 cs.LG cs.AI cs.RO

Enhancing Reinforcement Learning in 3D Environments through Semantic Segmentation: A Case Study in ViZDoom

通过语义分割增强3D环境中的强化学习:以ViZDoom为例

Jin Huang

AI总结 提出SS-only和RGB+SS两种输入表示,利用语义分割减少内存消耗并提升强化学习在3D环境中的性能,在ViZDoom中验证。

Comments Master's Thesis at the University of Edinburgh (2024)

详情
AI中文摘要

在高维感官输入的3D环境中,强化学习面临两大挑战:(1) 稳定学习所需的内存缓冲区导致的高内存消耗,以及(2) 部分可观测马尔可夫决策过程(POMDPs)的复杂性。本项目通过提出两种新颖的输入表示:SS-only和RGB+SS,两者均对RGB彩色图像进行语义分割,以应对这些挑战。在ViZDoom的死斗模式中进行了实验,利用完美的分割结果进行受控评估。我们的结果表明,SS-only能够将内存缓冲区的内存消耗减少至少66.6%,当应用如游程编码等最小开销的可向量化无损压缩技术时,可减少高达98.6%。同时,RGB+SS通过提供的额外语义信息显著增强了强化学习代理的性能。此外,我们探索了基于密度的热力图作为可视化强化学习代理移动模式的工具,并评估了其用于数据收集的适用性。与先前方法的简要比较突出了我们的方法如何克服在ViZDoom等3D环境中应用语义分割时的常见陷阱。

英文摘要

Reinforcement learning (RL) in 3D environments with high-dimensional sensory input poses two major challenges: (1) the high memory consumption induced by memory buffers required to stabilise learning, and (2) the complexity of learning in partially observable Markov Decision Processes (POMDPs). This project addresses these challenges by proposing two novel input representations: SS-only and RGB+SS, both employing semantic segmentation on RGB colour images. Experiments were conducted in deathmatches of ViZDoom, utilizing perfect segmentation results for controlled evaluation. Our results showed that SS-only was able to reduce the memory consumption of memory buffers by at least 66.6%, and up to 98.6% when a vectorisable lossless compression technique with minimal overhead such as run-length encoding is applied. Meanwhile, RGB+SS significantly enhances RL agents' performance with the additional semantic information provided. Furthermore, we explored density-based heatmapping as a tool to visualise RL agents' movement patterns and evaluate their suitability for data collection. A brief comparison with a previous approach highlights how our method overcame common pitfalls in applying semantic segmentation in 3D environments like ViZDoom.

2511.11505 2026-05-29 cs.LG

FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models

FarSkip-Collective: 解除混合专家模型中的阻塞通信

Yonatan Dukler, Guihong Li, Deval Shah, Jiang Liu, Vikram Appia, Emad Barsoum

AI总结 提出FarSkip-Collective方法,通过修改模型架构跳过连接以重叠计算与通信,在16B至109B参数的最新模型上实现与原始版本相当的精度,并在推理和训练中显著加速。

Comments MLSys'26

详情
AI中文摘要

阻塞通信是在分布式环境中高效运行MoE的主要障碍。为此,我们提出了FarSkip-Collective,它修改了现代模型的架构,使其计算与通信能够重叠。我们的方法修改了模型架构以跳过连接,但事先并不清楚修改后的模型架构是否能保持同等能力,特别是对于大型最先进模型以及修改所有模型层的情况。我们对此给出了肯定回答,并完全转换了一系列参数从16B到109B的最新模型,使其通信能够重叠,同时实现了与其原始开源版本相当的精度。例如,我们通过自蒸馏转换了Llama 4 Scout (109B),在广泛的 downstream 评估中,其平均精度在指令调优版本的1%以内。除了证明大型修改模型的精度保持外,我们还通过显式重叠通信与计算的优化实现,实现了FarSkip-Collective的优势,在现有框架中加速了训练和推理。在推理方面,我们在SGLang中使用专家并行服务转换后的DeepSeek-V3架构时,实现了32.6%的首令牌时间加速,并在预填充阶段达到了97.3%的通信-计算重叠。在训练期间,我们的方法在使用专家并行预训练DeepSeek-V3 MoE层时,实现了88.9%的全到全通信集合的通信重叠。

英文摘要

Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings. To address this, we present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication. Our approach modifies the architecture to skip connections in the model and it is unclear a priori whether the modified model architecture can remain as capable, especially for large state-of-the-art models and while modifying all of the model layers. We answer this question in the affirmative and fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication while achieving accuracy that is comparable with their original open-source releases. For example, we convert Llama 4 Scout (109B) via self-distillation and achieve average accuracy within 1% of its instruction tuned release averaged over a wide range of downstream evaluations. In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations that explicitly overlap communication with computation, accelerating both training and inference in existing frameworks. For inference, we demonstrate 32.6% speedup in Time To First Token when serving a converted DeepSeek-V3 architecture with expert parallelism in SGLang and achieve 97.3% communication-computation overlap during the prefill stage. During training, our approach enables 88.9% communication overlap of the all-to-all communication collectives when pre-training DeepSeek-V3 MoE layers with expert parallelism.

2511.10861 2026-05-29 cs.CV cs.AI cs.LG

An accuracy-aware extension to LRP-based pruning for CNNs to prevent cascading accuracy degradation in data-scarce transfer learning

一种面向CNN的基于LRP剪枝的精度感知扩展,以防止数据稀缺迁移学习中的级联精度下降

Daisuke Yasui, Toshitaka Matsuki, Hiroshi Sato

AI总结 针对数据稀缺迁移学习中预训练CNN剪枝导致的级联精度下降问题,提出一种精度感知的剪枝控制机制,通过动态调整剪枝率和顺序来抑制精度下降,提升模型压缩后的分类性能。

Comments Accepted to scientific reports. The title was revised during the peer review process

详情
AI中文摘要

在大规模数据集(如ImageNet)上预训练的卷积神经网络(CNN)被广泛用作特征提取器,从稀缺数据中构建特定任务的高精度分类模型。在此类场景中,由于数据稀缺,微调预训练CNN变得困难,因此必须使用固定权重。然而,当权重固定时,许多对目标任务无贡献的滤波器仍保留在模型中,导致不必要的冗余和效率降低。因此,需要有效的方法通过剪枝对推理不必要的滤波器来减小模型大小。为此,已有研究提出了利用逐层相关性传播(LRP)的方法。LRP量化每个滤波器对推理结果的贡献,从而可以剪枝低相关性的滤波器。然而,现有基于LRP的剪枝方法被观察到会导致级联精度下降。在本研究中,我们为现有基于LRP的滤波器剪枝方法引入了一种精度感知的剪枝控制机制,该机制通过使用类别精度的调和平均数动态调整剪枝率和剪枝顺序,抑制级联精度下降,并在小数据环境下压缩预训练模型的同时保持任务特定性能。我们证明,该控制机制有效缓解了级联精度下降,与现有基于LRP的剪枝方法相比,实现了更高的分类精度,将VGG16的精度-剪枝率曲线下的类别平均面积(AUC)比传统基于LRP的方法提高了约15%。

英文摘要

Convolutional Neural Networks (CNNs) pre-trained on large-scale datasets such as ImageNet are widely used as feature extractors to construct high-accuracy classification models from scarce data for specific tasks. In such scenarios, fine-tuning the pre-trained CNN is difficult due to data scarcity, necessitating the use of fixed weights. However, when the weights are kept fixed, many filters that do not contribute to the target task remain in the model, leading to unnecessary redundancy and reduced efficiency. Therefore, effective methods are needed to reduce model size by pruning filters that are unnecessary for inference. To address this, approaches utilizing Layer-wise Relevance Propagation (LRP) have been proposed. LRP quantifies the contribution of each filter to the inference result, enabling the pruning of filters with low relevance. However, existing LRP-based pruning methods have been observed to cause cascading accuracy degradation. In this study, we introduce an accuracy-aware pruning control mechanism for existing LRP-based filter pruning methods, which suppresses cascading accuracy degradation by dynamically adjusting the pruning rate and the pruning order using the harmonic mean of class accuracy, and compresses the pre-trained model while preserving task-specific performance in a small-data environment. We demonstrate that this control mechanism effectively mitigates cascading accuracy degradation and achieves higher classification accuracy compared to existing LRP-based pruning methods, improving the class-averaged area under the accuracy-pruning-rate curve (AUC) of VGG16 by approximately 15\% over conventional LRP-based approaches.

2511.04934 2026-05-29 cs.LG

Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

Leak@$k$:在概率解码下,遗忘并未使LLM真正忘记

Hadi Reisizadeh, Jiajun Ruan, Yiwei Chen, Soumyadeep Pal, Sijia Liu, Mingyi Hong

AI总结 本文发现现有大语言模型遗忘方法在概率解码下无法真正删除敏感信息,并提出新指标leak@$k$和算法RULE来评估和缓解知识泄露。

详情
AI中文摘要

大型语言模型(LLM)中的遗忘对于法规遵从和构建避免产生私人、有毒、非法或受版权保护内容的伦理生成式AI系统至关重要。尽管进展迅速,但在这项工作中,我们表明 extit{几乎所有}现有的遗忘方法在实践中都未能实现真正的遗忘。具体来说,虽然在确定性(贪婪)解码下对这些“已遗忘”模型的评估通常表明使用标准基准成功移除了知识,但我们表明当使用标准概率解码对模型进行采样时,敏感信息可靠地重新出现。为了严格捕捉这一漏洞,我们引入了 exttt{leak@$k$},一种新的元评估指标,用于量化在现实解码策略下从模型生成$k$个样本时遗忘知识重新出现的可能性。使用三个广泛采用的基准TOFU、MUSE和WMDP,我们使用 exttt{leak@$k$}指标进行了首次大规模、系统性的遗忘可靠性研究。我们的发现表明,知识泄露在方法和任务中持续存在,强调当前最先进的遗忘技术仅提供有限的遗忘。我们提出了一种算法,称为基于Leak@$k$指标的鲁棒遗忘( exttt{RULE})来解决这一问题。我们证明, exttt{RULE}为TOFU基准提供了一个已遗忘模型,在大量生成样本下没有信息泄露。在MUSE基准上, exttt{RULE}在大多数采样预算$k$下,在 exttt{leak@$k$}指标上优于最先进的遗忘方法。代码可在https://github.com/OptimAI-Lab/Leak-k获取。

英文摘要

Unlearning in large language models (LLMs) is critical for regulatory compliance and for building ethical generative AI systems that avoid producing private, toxic, illegal, or copyrighted content. Despite rapid progress, in this work, we show that \textit{almost all} existing unlearning methods fail to achieve true forgetting in practice. Specifically, while evaluations of these `unlearned' models under deterministic (greedy) decoding often suggest successful knowledge removal using standard benchmarks, we show that sensitive information reliably resurfaces when models are sampled with standard probabilistic decoding. To rigorously capture this vulnerability, we introduce \texttt{leak@$k$}, a new meta-evaluation metric that quantifies the likelihood of forgotten knowledge reappearing when generating $k$ samples from the model under realistic decoding strategies. Using three widely adopted benchmarks, TOFU, MUSE, and WMDP, we conduct the first large-scale, systematic study of unlearning reliability using \texttt{leak@$k$} metric. Our findings demonstrate that knowledge leakage persists across methods and tasks, underscoring that current state-of-the-art (SOTA) unlearning techniques provide only limited forgetting. We propose an algorithm, termed Robust Unlearning under LEak@$k$ metric (\texttt{RULE}) to address this concern. We demonstrate that \texttt{RULE} provides an unlearned model for TOFU benchmark with no information leakage for a large number of generation samples. On the MUSE benchmark, \texttt{RULE} outperforms SOTA unlearning methods under the \texttt{leak@$k$} metric across most sampling budgets $k$. Codes are available at https://github.com/OptimAI-Lab/Leak-k.

2510.27607 2026-05-29 cs.CV cs.RO

Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

双流扩散用于世界模型增强的视觉-语言-动作模型

John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, Jinwoo Shin

AI总结 提出DUST框架,通过双流扩散Transformer和异步采样方法,解决世界模型增强的视觉-语言-动作模型中的模态差距问题,在模拟和真实任务中取得显著性能提升。

Comments Accepted at ICML 2026. Project page at https://periphanes.github.io/dust (20 pages, 10 figures)

详情
AI中文摘要

用世界模型增强视觉-语言-动作模型(VLA)对于机器人策略学习很有前景,但由于模态差距,在联合预测状态和动作方面面临挑战。为了解决这个问题,我们提出了DUal-STream diffusion(DUST),一个世界模型增强的VLA框架,其特点是一个多模态扩散Transformer,在保持独立模态流的同时实现跨模态知识共享。此外,DUST利用独立的噪声扰动和解耦的流匹配损失来学习跨模态因果关系。我们进一步引入了一种用于动作和视觉令牌的异步采样方法,通过推理时缩放来增强性能。在RoboCasa和GR-1等模拟基准上的实验结果表明,DUST相对于最先进的VLA和世界建模基线实现了高达6%的性能提升,推理时缩放额外提供了2-5%的提升。在使用Franka Research 3的真实世界任务中,DUST的成功率比基线高出10%。最后,我们证明了DUST通过在无动作视频上的预训练以及与异构机器人和人类数据集的联合训练,实现了有效的迁移学习。

英文摘要

Augmenting vision-language-action models (VLAs) with world models is promising for robotic policy learning but faces challenges in jointly predicting states and actions due to the modality gap. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework featuring a multimodal diffusion transformer that maintains separate modality streams while enabling cross-modal knowledge sharing. In addition, DUST utilizes independent noise perturbations and a decoupled flow matching loss to learn cross-modal causal relationships. We further introduce an asynchronous sampling method for action and vision tokens that enhances performance through inference-time scaling. Experimental results on simulated benchmarks like RoboCasa and GR-1 show that DUST achieves up to 6% gains over state-of-the-art VLA and world-modeling baselines, with inference-time scaling providing an additional 2-5% improvement. In real-world tasks using the Franka Research 3, DUST outperforms baselines by 10% in success rate. Finally, we demonstrate that DUST enables effective transfer learning through both pretraining on action-free videos and joint-training with heterogeneous robot and human datasets.

2510.26623 2026-05-29 cs.RO

A Sliding-Window Filter for Online Continuous-Time Continuum Robot State Estimation

用于在线连续时间连续体机器人状态估计的滑动窗口滤波器

Spencer Teetaert, Sven Lilge, Jessica Burgner-Kahrs, Timothy D. Barfoot

AI总结 提出一种专为连续体机器人设计的随机滑动窗口滤波器,在保持超实时运行速度的同时,通过连续时间方法提升滤波精度并实现在线操作。

Comments 8 pages, 6 figures. Submitted to IEEE-RAS International Conference on Soft Robotics 2026

详情
Journal ref
2026 IEEE 9th International Conference on Soft Robotics (RoboSoft), 239-246
AI中文摘要

连续体机器人的随机状态估计方法通常难以平衡精度和计算效率。尽管最近有几项研究探索了连续体机器人的滑动窗口公式,但这些方法仅限于简化的离散时间近似,并且不提供随机表示。相比之下,当前的随机滤波方法必须以测量速度运行,限制了其全部潜力。最近关于连续体机器人连续时间估计技术的研究显示了一种解决这一运行时约束的原则性方法,但目前仅限于离线操作。在这项工作中,我们提出了一种用于连续体机器人连续时间状态估计的滑动窗口滤波器,它在保持超实时运行速度的同时,改进了滤波方法的精度,并使连续时间方法能够在线操作。这是首个专门为连续体机器人设计的随机滑动窗口滤波器,为该领域的未来研究提供了有希望的方向。

英文摘要

Stochastic state estimation methods for continuum robots (CRs) often struggle to balance accuracy and computational efficiency. While several recent works have explored sliding-window formulations for CRs, these methods are limited to simplified, discrete-time approximations and do not provide stochastic representations. In contrast, current stochastic filter methods must run at the speed of measurements, limiting their full potential. Recent works in continuous-time estimation techniques for CRs show a principled approach to addressing this runtime constraint, but are currently restricted to offline operation. In this work, we present a sliding-window filter (SWF) for continuous-time state estimation of CRs that improves upon the accuracy of a filter approach while enabling continuous-time methods to operate online, all while running at faster-than-real-time speeds. This represents the first stochastic SWF specifically designed for CRs, providing a promising direction for future research in this area.

2510.26412 2026-05-29 cs.CV cs.AI

LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation

LoCoT2V-Bench: 长文本与复杂文本到视频生成的基准测试

Xiangqing Zheng, Chengyue Wu, Kehai Chen, Min Zhang

AI总结 针对长视频生成在复杂文本输入下的评估挑战,提出包含多场景提示与层次元数据的基准LoCoT2V-Bench,并设计多维度评估框架LoCoT2V-Eval,实验发现模型在细粒度文本-视频对齐和角色一致性方面存在显著不足。

Comments Accepted by ICML 2026 (Regular)

详情
AI中文摘要

近期文本到视频生成在短片段上取得了令人印象深刻的性能,但在复杂文本输入下评估长视频生成仍然是一个重大挑战。为应对这一挑战,我们提出了LoCoT2V-Bench,一个用于长视频生成(LVG)的基准,包含具有层次元数据(如角色设置和相机行为)的多场景提示,这些提示从收集的真实世界视频中构建。我们进一步提出了LoCoT2V-Eval,一个多维度评估框架,涵盖感知质量、文本-视频对齐、时间质量、动态质量和人类期望实现程度(HERD),重点关注细粒度文本-视频对齐和时间角色一致性等方面。在17个代表性LVG模型上的实验揭示了评估维度之间的显著能力差异,模型在感知质量和背景一致性方面表现强劲,但在细粒度文本-视频对齐和角色一致性方面明显较弱。这些发现表明,提高提示忠实度和身份保持仍是长视频生成的关键挑战。我们的代码和数据发布在https://github.com/XqZeppelinhead0702/LoCoT2V-Bench。

英文摘要

Recent advances in text-to-video generation have achieved impressive performance on short clips, yet evaluating long-form generation under complex textual inputs remains a significant challenge. In response to this challenge, we present LoCoT2V-Bench, a benchmark for long video generation (LVG) featuring multi-scene prompts with hierarchical metadata (e.g., character settings and camera behaviors), constructed from collected real-world videos. We further propose LoCoT2V-Eval, a multi-dimensional framework covering perceptual quality, text-video alignment, temporal quality, dynamic quality, and Human Expectation Realization Degree (HERD), with an emphasis on aspects such as fine-grained text-video alignment and temporal character consistency. Experiments on 17 representative LVG models reveal pronounced capability disparities across evaluation dimensions, with strong perceptual quality and background consistency but markedly weaker fine-grained text-video alignment and character consistency. These findings suggest that improving prompt faithfulness and identity preservation remains a key challenge for long-form video generation. Our code and data are released at https://github.com/XqZeppelinhead0702/LoCoT2V-Bench

2510.24606 2026-05-29 cs.CL

Long-Context Modeling with Dynamic Hierarchical Sparse Attention for Memory-Constrained LLM Inference

基于动态层次稀疏注意力的内存受限大语言模型长上下文建模

Siheng Xiong, Joe Zou, Faramarz Fekri, Yae Jee Cho

AI总结 提出动态层次稀疏注意力(DHSA),通过在线预测注意力稀疏性并保持LLM骨干冻结,在内存受限下实现高效长上下文推理,精度接近密集注意力且速度提升显著。

Comments ICML26 (Spotlight)

详情
AI中文摘要

注意力的二次成本限制了长上下文大语言模型的可扩展性,尤其是在有限的硬件内存预算下。虽然注意力通常是稀疏的,但现有的静态稀疏方法无法适应任务或输入相关的变化,而最近的动态方法依赖于可能牺牲通用性的预定义模板或启发式规则。我们提出了动态层次稀疏注意力(DHSA),一种数据驱动的框架,在保持LLM骨干冻结的同时在线预测注意力稀疏性。DHSA通过估计块级重要性并将其传播到令牌级交互来执行层次路由,保留了因果重要依赖关系,同时实现了高效稀疏化。在Needle-in-a-Haystack测试、LongBench和RULER上,DHSA在高稀疏度下保持接近密集的精度,在相当预填充成本下,相对于块稀疏注意力实现了12-20%的相对精度提升。借助内存高效的瓦片后端,DHSA在128K上下文长度下实现了高达10倍的预填充加速。在LLaMA-3.1-8B(4位)上,DHSA在单个24GB GPU上扩展到100K上下文,而密集注意力无法做到。我们提供了互补的GPU和CPU后端,使DHSA能够在不同的硬件环境和多个开放权重模型系列上运行。这些结果表明,DHSA是内存受限长上下文LLM推理的一种高效且适应性强的解决方案。

英文摘要

The quadratic cost of attention limits the scalability of long-context LLMs, especially under limited hardware memory budgets. While attention is often sparse, existing static sparse methods cannot adapt to task- or input-dependent variations, and recent dynamic approaches rely on predefined templates or heuristics that may sacrifice generality. We propose Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that predicts attention sparsity online while keeping the LLM backbone frozen. DHSA performs hierarchical routing by estimating importance at the chunk level and propagating it to token-level interactions, preserving causally important dependencies while enabling efficient sparsification. Across Needle-in-a-Haystack test, LongBench and RULER, DHSA maintains near-dense accuracy in highly sparse regimes, achieving 12--20% relative accuracy gains over Block Sparse Attention at comparable prefill cost. With a memory-efficient tiled backend, DHSA delivers up to $10\times$ prefill speedup at 128K context length. On LLaMA-3.1-8B (4-bit), DHSA scales to 100K context on a single 24GB GPU, where dense attention fails. We provide complementary GPU and CPU backends, enabling DHSA to run across diverse hardware environments and multiple open-weight model families. These results demonstrate DHSA as an efficient and adaptable solution for memory-constrained long-context LLM inference.

2510.16658 2026-05-29 cs.AI cs.CE

Large-Scale AI and Foundation Models for Neuroscience: A Comprehensive Review

大规模人工智能与基础模型在神经科学中的应用:综合综述

Shihao Yang, Xiying Huang, Danilo Bernardo, Jun-En Ding, Andrew Michael, Guoan Wang, Jingmei Yang, Alison Anderson, Dinesh Giritharan, Patrick Kwan, Ashish Raj, Yu Zhang, Feng Liu

AI总结 本文综述了大规模AI模型在神经科学四个主要领域(神经影像与数据处理、脑机接口与神经解码、临床决策支持与转化框架、神经系统与精神疾病特定应用)的应用,展示了其在多模态数据整合、时空模式解释和临床转化方面的潜力,并强调了严格评估、领域知识整合、临床验证和伦理指南的重要性。

Comments Accepted for publication in Meta-Radiology

详情
AI中文摘要

大规模人工智能(AI)模型的发展通过实现从原始脑信号和神经数据的端到端学习,正在影响神经科学研究。本文综述了大规模AI模型在四个主要神经科学领域的应用:神经影像与数据处理、脑机接口与神经解码、临床决策支持与转化框架,以及神经系统和精神疾病的特定应用。这些模型显示出解决重大计算神经科学挑战的潜力,包括多模态神经数据整合、时空模式解释以及为临床研究开发转化框架。此外,神经科学与AI之间的相互作用已变得日益互惠,因为现在融入了受生物学启发的架构约束,以开发更具可解释性和计算效率的模型。本综述既强调了此类技术的潜力,也强调了关键的实现考虑因素,特别关注严格的评估框架、领域知识的有效整合、前瞻性临床验证以及全面的伦理指南。最后,提供了用于开发和评估跨不同研究应用的大规模AI模型的关键神经科学数据集的系统列表。

英文摘要

The development of large-scale artificial intelligence (AI) models is influencing neuroscience research by enabling end-to-end learning from raw brain signals and neural data. In this paper, we review applications of large-scale AI models across four major neuroscience domains: neuroimaging and data processing, brain-computer interfaces and neural decoding, clinical decision support and translational frameworks, and disease-specific applications across neurological and psychiatric disorders. These models show potential to address major computational neuroscience challenges, including multimodal neural data integration, spatiotemporal pattern interpretation, and the development of translational frameworks for clinical research. Moreover, the interaction between neuroscience and AI has become increasingly reciprocal, as biologically informed architectural constraints are now incorporated to develop more interpretable and computationally efficient models. This review highlights both the promise of such technologies and critical implementation considerations, with particular emphasis on rigorous evaluation frameworks, effective integration of domain knowledge, prospective clinical validation, and comprehensive ethical guidelines. Finally, a systematic listing of critical neuroscience datasets used to develop and evaluate large-scale AI models across diverse research applications is provided.

2510.16060 2026-05-29 cs.LG cs.AI stat.ME stat.ML

Beyond Accuracy: Are Time Series Foundation Models Well-Calibrated?

超越准确性:时间序列基础模型是否良好校准?

Coen Adler, Yuxin Chang, Felix Draxler, Samar Abdi, Padhraic Smyth

AI总结 本文系统评估了五个时间序列基础模型和两个基线的校准特性,发现基础模型校准优于基线且无系统性过度自信或信心不足。

Comments Published as a conference paper at ICLR 2026

详情
Journal ref
Proceedings of ICLR 2026
AI中文摘要

最近时间序列数据基础模型的发展引起了在各种应用中使用此类模型的广泛兴趣。尽管基础模型实现了最先进的预测性能,但它们的校准特性仍然相对未被充分探索,尽管校准在许多实际应用中可能至关重要。在本文中,我们研究了五个近期时间序列基础模型和两个竞争基线的校准相关特性。我们进行了一系列系统评估,包括模型校准(即过度自信或信心不足)、不同预测头的影响以及长期自回归预测下的校准。我们发现时间序列基础模型始终比基线模型校准得更好,并且往往不会系统性地过度自信或信心不足,这与在其他深度学习模型中常见的过度自信形成对比。

英文摘要

The recent development of foundation models for time series data has generated considerable interest in using such models across a variety of applications. Although foundation models achieve state-of-the-art predictive performance, their calibration properties remain relatively underexplored, despite the fact that calibration can be critical for many practical applications. In this paper, we investigate the calibration-related properties of five recent time series foundation models and two competitive baselines. We perform a series of systematic evaluations assessing model calibration (i.e., over- or under-confidence), effects of varying prediction heads, and calibration under long-term autoregressive forecasting. We find that time series foundation models are consistently better calibrated than baseline models and tend not to be either systematically over- or under-confident, in contrast to the overconfidence often seen in other deep learning models.

2510.14365 2026-05-29 cs.CL

Understanding the Ability of LLMs to Handle Character-Level Perturbation

理解LLMs处理字符级扰动的能力

Anyuan Zhuo, Xuefei Ning, Ningyuan Li, Jingyi Zhu, Yu Wang, Pinyan Lu

AI总结 本研究通过三种字符级扰动(单词内大量拼写错误、字符乱序、插入不可见字符)测试大型语言模型的鲁棒性,发现即使严重扰动下模型仍保持显著性能,并探索了其内在机制。

Comments Accepted by icml2026

详情
AI中文摘要

这项工作研究了当代大型语言模型(LLMs)对常见字符级扰动的鲁棒性。我们考察了三种字符级扰动,包括在单词中引入大量拼写错误、打乱每个单词中的字符顺序,以及在文本中插入大量不可见字符。令人惊讶的是,即使在严重扰动下,例如几乎逐字打乱所有单词以生成人类几乎无法阅读的文本,或插入比可见字符多几倍的不可见字符作为噪声,许多LLMs仍然保持显著的性能。我们探索了这种鲁棒性的潜在原因,发现LLMs对混乱的分词和碎片化的词元化表现出显著的韧性。此外,我们研究了LLMs去除扰动以正确理解文本的机制,包括隐式和显式的字符级扰动处理机制。我们希望我们对LLMs低级鲁棒性的发现将揭示其固有的架构优势,揭示其被滥用的潜在风险,并为LLMs在不同应用场景中的可靠部署提供信息。

英文摘要

This work investigates the resilience of contemporary large language models (LLMs) against frequent character-level perturbations. We examine three types of character-level perturbations including introducing numerous typos within words, shuffling the characters in each word, and inserting a large number of invisible characters into the text. Surprisingly, even under severe perturbation, such as shuffling nearly all words character-wise to produce text that is almost unreadable to humans, or inserting invisible characters which are several times more than the visible ones as noise, many LLMs still maintain notable performance. We explore the underlying causes of this robustness and find that LLMs exhibit remarkable resilience to chaotic segmentation and fragmented tokenization. Furthermore, we examine the mechanisms by which LLMs remove perturbations to correctly comprehend text, including both implicit and explicit mechanisms for character-level perturbation. We hope that our findings on the low-level robustness of LLMs will unveil their inherent architectural strengths, reveal the potential risks of their misuse, and inform the reliable deployment of LLMs across diverse application scenarios.

2510.10961 2026-05-29 cs.CL cs.AI

Obfuscation Rules for Detecting and Detoxifying Korean Toxicity

用于检测和去毒化韩语毒性内容的混淆规则

Yejin Lee, Su-Hyeon Kim, Hyundong Jin, Dayoung Kim, Yeonsoo Kim, Yo-Sub Han

AI总结 本文提出KOTOX数据集,通过定义基于语言学的韩语混淆规则和变换框架,支持对混淆毒性文本的去混淆与去毒化,首次同时实现韩语混淆毒性检测与净化。

Comments 26 pages, 12 figures, 24 tables

详情
AI中文摘要

随着语言模型越来越多地部署在线环境中,毒性检测和去毒化已受到越来越多的关注。现有研究主要关注非混淆文本,这限制了当用户故意伪装毒性表达时的鲁棒性。特别是,韩语毒性表达可以通过黏着形态学和韩文特有的正字法变体轻易伪装。然而,韩语中的混淆现象在很大程度上尚未被探索,这促使我们引入KOTOX:用于去混淆和去毒化的韩语毒性数据集。我们将韩语混淆模式分类为基于语言学的类别,定义从真实世界示例中推导出的变换规则,并将生成的混淆框架作为开放的变换包提供。利用这些规则,我们提供了配对的非毒性和毒性句子及其混淆版本。在我们的数据集上训练的模型能更好地处理混淆文本,而不会牺牲在非混淆文本上的性能。这是首个同时支持韩语去混淆和去毒化的数据集。我们期望该数据集能促进大型语言模型对韩语混淆毒性内容的更好理解和缓解。我们的代码和数据可在 https://github.com/leeyejin1231/KOTOX 获取。

英文摘要

As language models become increasingly deployed in online environments, toxicity detection and detoxification have received growing attention. Existing studies primarily focus on non-obfuscated text, which limits robustness when users intentionally disguise toxic expressions. In particular, Korean toxic expressions can be easily disguised through agglutinative morphology and Hangeul-specific orthographic variation. However, obfuscation in Korean remains largely unexplored, which motivates us to introduce a KOTOX: Korean toxic dataset for deobfuscation and detoxification. We categorize Korean obfuscation patterns into linguistically grounded classes, define transformation rules derived from real-world examples, and provide the resulting obfuscation framework as an open transformation package. Using these rules, we provide paired neutral and toxic sentences alongside their obfuscated counterparts. Models trained on our dataset better handle obfuscated text without sacrificing performance on non-obfuscated text. This is the first dataset that simultaneously supports deobfuscation and detoxification for the Korean language. We expect the dataset to facilitate better understanding and mitigation of obfuscated toxic content in LLM for Korean. Our code and data are available at https://github.com/leeyejin1231/KOTOX.

2510.06182 2026-05-29 cs.CL

Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context

混合机制:语言模型如何在上下文中检索绑定实体

Yoav Gur-Arieh, Mor Geva, Atticus Geiger

AI总结 本文研究了语言模型在上下文中绑定并检索实体的三种机制(位置机制、词汇机制和反射机制),通过九种模型和十项绑定任务的实验揭示了它们的混合模式,并构建了一个因果模型以95%的一致性估计下一词分布。

Comments Accepted to ICLR 2026 Main Conference

详情
AI中文摘要

上下文推理的一个关键组成部分是语言模型(LMs)绑定实体以便后续检索的能力。例如,一个LM可能通过将“Ann”绑定到“pie”来表示“Ann loves pie”,从而在回答“谁喜欢pie?”时检索到“Ann”。先前关于短列表绑定实体的研究发现了强有力的证据,表明LMs通过一种位置机制实现这种检索,即根据“Ann”在上下文中的位置来检索它。在这项工作中,我们发现这种机制在更复杂的设置中泛化能力较差;随着上下文中绑定实体数量的增加,位置机制在中间位置变得嘈杂且不可靠。为了弥补这一点,我们发现LMs用词汇机制(通过其绑定对应物“pie”检索“Ann”)和反射机制(通过直接指针检索“Ann”)来补充位置机制。通过对九种模型和十项绑定任务的广泛实验,我们揭示了LMs如何混合这些机制以驱动模型行为的一致模式。我们利用这些见解开发了一个结合所有三种机制的因果模型,该模型以95%的一致性估计下一词分布。最后,我们展示了我们的模型能够泛化到与实体组交错的长得多的开放文本输入,进一步证明了我们的发现在更自然环境中的鲁棒性。总体而言,我们的研究建立了关于LMs如何在上下文中绑定和检索实体的更完整图景。

英文摘要

A key component of in-context reasoning is the ability of language models (LMs) to bind entities for later retrieval. For example, an LM might represent "Ann loves pie" by binding "Ann" to "pie", allowing it to later retrieve "Ann" when asked "Who loves pie?" Prior research on short lists of bound entities found strong evidence that LMs implement such retrieval via a positional mechanism, where "Ann" is retrieved based on its position in context. In this work, we find that this mechanism generalizes poorly to more complex settings; as the number of bound entities in context increases, the positional mechanism becomes noisy and unreliable in middle positions. To compensate for this, we find that LMs supplement the positional mechanism with a lexical mechanism (retrieving "Ann" using its bound counterpart "pie") and a reflexive mechanism (retrieving "Ann" through a direct pointer). Through extensive experiments on nine models and ten binding tasks, we uncover a consistent pattern in how LMs mix these mechanisms to drive model behavior. We leverage these insights to develop a causal model combining all three mechanisms that estimates next token distributions with 95% agreement. Finally, we show that our model generalizes to substantially longer inputs of open-ended text interleaved with entity groups, further demonstrating the robustness of our findings in more natural settings. Overall, our study establishes a more complete picture of how LMs bind and retrieve entities in-context.

2510.06063 2026-05-29 cs.AI cs.IT cs.LG math.IT

TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis

TelecomTS:面向时间序列与语言分析的多模态可观测性数据集

Austin Feng, Andreas Varvarigos, Ioannis Panitsas, Daniela Fernandez, Jinbiao Wei, Yuwei Guo, Jialin Chen, Ali Maatouk, Leandros Tassiulas, Rex Ying

AI总结 本文提出TelecomTS,一个来自5G电信网络的大规模多模态可观测性数据集,通过保留绝对尺度信息的异质协变量和多样化下游任务(异常检测、根因分析、多模态问答),揭示了现有模型在处理可观测性数据的高噪声、突变特性时的不足。

详情
AI中文摘要

现代企业在监控复杂系统时会产生大量的时间序列指标流,即所谓的可观测性数据。与来自气候等领域的传统时间序列不同,可观测性数据具有零膨胀、高度随机且时间结构极小的特点。尽管这些数据至关重要,但由于专有限制和隐私问题,可观测性数据集在公开基准中仍然代表性不足。现有数据集通常经过匿名化和归一化处理,去除了尺度信息,限制了其在异常检测、根因分析和多模态推理等任务中的应用。为弥补这一空白,我们引入了TelecomTS,这是一个源自5G电信网络的大规模可观测性数据集。TelecomTS包含具有明确绝对尺度信息的异质、去匿名化协变量,并提供多样化的下游任务套件,包括异常检测、根因分析和多模态问答。对最先进的时间序列、语言、推理和多模态基础模型的基准测试表明,现有方法难以应对可观测性数据特有的突变、高噪声和高方差动态特性。我们的实验进一步强调了保留协变量绝对尺度的重要性,凸显了开发能够原生利用尺度信息的基础时间序列模型以应对实际可观测性应用需求的必要性。代码可在https://github.com/Ali-maatouk/TelecomTS获取。

英文摘要

Modern enterprises generate vast streams of time series metrics when monitoring complex systems, known as observability data. Unlike conventional time series from domains such as climate, observability data are zero-inflated, highly stochastic, and exhibit minimal temporal structure. Despite their importance, observability datasets remain underrepresented in public benchmarks due to proprietary restrictions and privacy concerns. Existing datasets are often anonymized and normalized, removing scale information and limiting their use for tasks such as anomaly detection, root cause analysis, and multi-modal reasoning. To address this gap, we introduce TelecomTS, a large-scale observability dataset derived from a 5G telecommunications network. TelecomTS features heterogeneous, de-anonymized covariates with explicit absolute scale information and provides a diverse suite of downstream tasks, including anomaly detection, root cause analysis, and multi-modal question-answering. Benchmarking state-of-the-art time series, language, reasoning, and multi-modal foundation models reveals that existing approaches struggle with the abrupt, noisy, and high-variance dynamics characteristic of observability data. Our experiments further underscore the importance of preserving covariates' absolute scale, emphasizing the need for foundation time series models that natively leverage scale information for practical real-world observability applications. The code is available at: https://github.com/Ali-maatouk/TelecomTS.

2510.04758 2026-05-29 cs.LG

Provable Affine Identifiability of Nonlinear CCA under Latent Distributional Priors

可证明的非线性CCA在潜在分布先验下的仿射可辨识性

Zhiwei Han, Stefan Matthes, Hao Shen

AI总结 本文通过将分析从观测空间迁移到源空间,利用双变量分布的正交多项式展开,证明了在特定分布先验下非线性典型相关分析(CCA)能够恢复真实潜在因子至仿射变换,并提供了岭正则化经验CCA在有限样本下收敛的理论基础。

详情
AI中文摘要

在这项工作中,我们建立了非线性典型相关分析(CCA)恢复真实潜在因子至仿射变换的充分条件。通过将分析从观测空间迁移到源空间,我们将双变量分布正交多项式展开的经典统计结果扩展到表示学习,证明了在特定分布先验下的仿射可辨识性。我们正式证明白化是确保所学映射有界性和良态性的严格必要条件。此外,我们通过证明岭正则化经验CCA在有限样本下收敛到其总体对应物,弥合了理论与实践之间的差距。最后,我们的发现为近期基于相关性的非对比学习方法的实证成功提供了严格的理论基础。在合成和渲染图像数据集上的实验以及系统性消融研究验证了预测的恢复行为,并说明了当假设被违反时出现的失败模式。

英文摘要

In this work, we establish the sufficient conditions under which nonlinear Canonical Correlation Analysis (CCA) recovers ground-truth latent factors up to an affine transformation. By transporting the analysis from the observation space to the source space, we extend classical statistical results on orthogonal polynomial expansions of bivariate distributions to representation learning, proving affine identifiability under specific distributional priors. We formally demonstrate that whitening is strictly necessary to ensure the boundedness and well-conditioning of the learned mappings. Furthermore, we bridge the gap between theory and practice by proving that ridge-regularized empirical CCA converges to its population counterpart in the finite-sample regime. Finally, our findings provide a rigorous theoretical foundation explaining the empirical success of recent correlation-based non-contrastive learning methods. Experiments on synthetic and rendered image datasets, alongside systematic ablations, validate the predicted recovery behavior and illustrate the failure modes that arise when the assumptions are violated.

2510.03013 2026-05-29 cs.LG

Distributional Inverse Reinforcement Learning

分布逆强化学习

Feiyang Wu, Ye Zhao, Anqi Wu

AI总结 提出一种离线逆强化学习的分布框架,通过最小化一阶随机占优违反并整合扭曲风险度量,联合建模奖励函数的不确定性和回报的完整分布,实现奖励分布和分布感知策略的恢复。

Comments ICML 2026 Oral

详情
AI中文摘要

我们提出了一种用于离线逆强化学习(IRL)的分布框架,该框架联合建模奖励函数的不确定性和回报的完整分布。与恢复确定性奖励估计或仅匹配期望回报的传统IRL方法不同,我们的方法通过最小化一阶随机占优(FSD)违反,从而将扭曲风险度量(DRMs)整合到策略学习中,捕捉专家行为中更丰富的结构,特别是在学习奖励分布方面,使得能够恢复奖励分布和分布感知策略。该公式非常适合行为分析和风险感知模仿学习。理论分析表明,该算法以$\mathcal{O}(\varepsilon^{-2})$的迭代复杂度收敛。在合成基准、真实神经行为数据和MuJoCo控制任务上的实验结果表明,我们的方法恢复了富有表现力的奖励表示,并实现了最先进的性能。

英文摘要

We propose a distributional framework for offline Inverse Reinforcement Learning (IRL) that jointly models uncertainty over reward functions and full distributions of returns. Unlike conventional IRL approaches that recover a deterministic reward estimate or match only expected returns, our method captures richer structure in expert behavior, particularly in learning the reward distribution, by minimizing first-order stochastic dominance (FSD) violations and thus integrating distortion risk measures (DRMs) into policy learning, enabling the recovery of both reward distributions and distribution-aware policies. This formulation is well-suited for behavior analysis and risk-aware imitation learning. Theoretical analysis shows that the algorithm converges with $\mathcal{O}(\varepsilon^{-2})$ iteration complexity. Empirical results on synthetic benchmarks, real-world neurobehavioral data, and MuJoCo control tasks demonstrate that our method recovers expressive reward representations and achieves state-of-the-art performance.