arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪
2506.12876 2026-05-14 cs.LG

MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs

Yan Sun, Qixin Zhang, Zhiyuan Yu, Xikun Zhang, Li Shen, Dacheng Tao

发表机构 * School of Computer Science Faculty of Engineering The University of Sydney(悉尼大学计算机科学与工程学院) Generative AI Lab College of Computing and Data Science Nanyang Technological University(南洋理工大学生成人工智能实验室) University of Science and Technology of China(中国科学技术大学) RMIT University(皇家理工大学) School of Cyber Science and Technology Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区网络科学与技术学院)

AI总结 随着大语言模型(LLMs)的快速扩展,推理效率已成为实际部署中的主要瓶颈。为解决这一问题,研究提出了一种名为MaskPro的线性空间概率学习框架,通过学习每M个连续权重的先验分布,生成严格(N:M)稀疏性结构,从而在保证硬件友好性的同时降低计算和内存开销。该方法通过引入损失残差的移动平均跟踪器,有效缓解了组合空间中策略梯度高方差带来的训练不稳定性,并在理论分析和实验验证中展现出优越的性能、内存效率和数据鲁棒性。

详情
英文摘要

The rapid scaling of large language models~(LLMs) has made inference efficiency a primary bottleneck in the practical deployment. To address this, semi-structured sparsity offers a promising solution by strategically retaining $N$ elements out of every $M$ weights, thereby enabling hardware-friendly acceleration and reduced memory. However, existing (N:M)-compatible approaches typically fall into two categories: rule-based layerwise greedy search, which suffers from considerable errors, and gradient-driven combinatorial learning, which incurs prohibitive training costs. To tackle these challenges, we propose a novel linear-space probabilistic framework named MaskPro, which aims to learn a prior categorical distribution for every $M$ consecutive weights and subsequently leverages this distribution to generate the (N:M)-sparsity throughout an $N$-way sampling without replacement. Furthermore, to mitigate the training instability induced by the high variance of policy gradients in the super large combinatorial space, we propose a novel update method by introducing a moving average tracker of loss residuals instead of vanilla loss. Finally, we conduct comprehensive theoretical analysis and extensive experiments to validate the superior performance of MaskPro, as well as its excellent scalability in memory efficiency and exceptional robustness to data samples. Our code is available at \href{https://github.com/woodenchild95/Maskpro.git}{\ttfamily https://github.com/woodenchild95/Maskpro.git}.

2506.11274 2026-05-14 cs.CL cs.LG

Learning a Continue-Thinking Token for Enhanced Test-Time Scaling

Liran Ringel, Elad Tolochinsky, Yaniv Romano

发表机构 * Department of Computer Science, Technion – Israel Institute of Technology(计算机科学系,技术学院–以色列理工学院) Department of Electrical and Computer Engineering, Technion – Israel Institute of Technology(电气与计算机工程系,技术学院–以色列理工学院)

AI总结 该研究探讨了如何通过引入一个专门的“继续思考”标记来增强语言模型在推理时的扩展推理能力。研究者在精简版的DeepSeek-R1模型中添加了一个可学习的"<|continue-thinking|>"标记,并仅通过强化学习训练其嵌入表示,而保持模型权重不变。实验表明,该方法在标准数学基准测试中相比基线模型和使用固定标记(如“Wait”)的测试时扩展方法,取得了更高的准确率提升,例如在GSM8K数据集上实现了4.2%的绝对提升。

详情
英文摘要

Test-time scaling has emerged as an effective approach for improving language model performance by utilizing additional compute at inference time. Recent studies have shown that overriding end-of-thinking tokens (e.g., replacing "</think>" with "Wait") can extend reasoning steps and improve accuracy. In this work, we explore whether a dedicated continue-thinking token can be learned to trigger extended reasoning. We augment a distilled version of DeepSeek-R1 with a single learned "<|continue-thinking|>" token, training only its embedding via reinforcement learning while keeping the model weights frozen. Our experiments show that this learned token achieves improved accuracy on standard math benchmarks compared to both the baseline model and a test-time scaling approach that uses a fixed token (e.g., "Wait") for budget forcing. In particular, we observe that in cases where the fixed-token approach enhances the base model's accuracy, our method achieves a markedly greater improvement. For example, on the GSM8K benchmark, the fixed-token approach yields a 1.3% absolute improvement in accuracy, whereas our learned-token method achieves a 4.2% improvement over the base model that does not use budget forcing.

2506.09522 2026-05-14 cs.CV cs.AI cs.CL

Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding

Beomsik Cho, Jaehyung Kim

发表机构 * Yonsei University(延世大学)

AI总结 该研究探讨了视觉信息在大视觉语言模型(LVLMs)解码过程中的作用,发现即使在出现幻觉的情况下,视觉token仍包含有意义的视觉信息,并且其语义可以在文本空间中被显式表达。基于此,研究提出了一种无需训练的解码方法ReVisiT,通过将视觉token投影到文本分布中,并在解码过程中动态选择最相关的视觉token来引导文本生成,从而提升模型对视觉语义的融合能力。实验表明,ReVisiT在多个基准测试中表现优异,同时减少了计算成本。

Comments ACL 2026 Main Conference (Oral). 30 pages, 10 figures. Code: https://github.com/bscho333/ReVisiT

详情
英文摘要

Large Vision Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding. However, how vision information contributes to the model's decoding process remains under-explored, as reflected in frequent hallucinations. Through a series of analyses, we found that (i) vision tokens provide meaningful visual information even when hallucinations occur, and (ii) their semantics are encoded in the textual space and become explicit under appropriate vocabulary constraints. Building on these observations, we propose ReVisiT, a simple training-free decoding method that guides text generation in LVLMs by Referencing Vision Tokens. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution. Specifically, ReVisiT dynamically selects the most relevant vision token at each decoding step via context-aware constrained divergence minimization. Then, ReVisiT uses its constrained projection to refine the output distribution to better incorporate visual semantics. Across five benchmarks on recent LVLMs, ReVisiT achieves competitive or superior results to state-of-the-art decoding baselines while reducing computational cost by up to $2\times$

2506.04165 2026-05-14 cs.LG cs.DS

A Faster Generalized Two-Stage Approximate Top-K

Yashas Samaga, Varun Yerram, Spandana Raj Babbula, Prateek Jain, Praneeth Netrapalli

发表机构 * University of Washington, Seattle(华盛顿大学) New York University(纽约大学) Google DeepMind(谷歌深Mind)

AI总结 本文研究了如何高效近似选出数组中最大的 $K$ 个元素(Top-$K$ 问题),该问题在许多机器学习算法中是性能瓶颈。作者在原有两阶段近似 Top-$K$ 算法基础上进行推广,使第一阶段每个分块选出更多候选元素,从而更有效地减少第二阶段处理的数据量。研究给出了该推广算法在随机分块下的预期召回率表达式,并证明了在保持相同召回率的前提下,选择更大的 $K'$ 和更少的分块能显著提升效率;同时提供了比之前更紧的召回率上界,并在 Cloud TPUv5e 上实现了该算法,相比原算法速度提升约一个数量级。

Comments Accepted at TMLR May 2026

详情
英文摘要

We consider the Top-$K$ selection problem, which aims to identify the largest $K$ elements in an array. Top-$K$ selection arises in many machine learning algorithms and often becomes a bottleneck on accelerators, which are optimized for dense matrix multiplications. To address this problem, Chern et al. (2022) proposed a fast two-stage approximate Top-$K$ algorithm that: (i) partitions the input array into equal-sized chunks and selects the top-$1$ element from each partition; and (ii) sorts the resulting smaller subset and returns the top $K$ elements. In this paper, we generalize the first stage so that each partition selects the top $K'$ elements (for $1 \leq K' \leq K$). Our contributions include: (i) an expression for the expected recall of this generalized algorithm under random partitioning, and a demonstration that choosing $K' > 1$ with fewer partitions in the first stage more effectively reduces the input size to the second stage while maintaining the same expected recall as the original algorithm; (ii) a bound on the expected recall of the original algorithm as a function of the algorithm parameters that is provably tighter by a factor of $2$ than the bound reported by Chern et al. (2022); and (iii) an implementation of our algorithm on Cloud TPUv5e that achieves approximately an order of magnitude speedup over the original algorithm without sacrificing recall.

2506.00982 2026-05-14 cs.RO cs.MA

Robust and Safe Multi-Agent Reinforcement Learning with Communication for Autonomous Vehicles: From Simulation to Hardware

Keshawn Smith, Zhili Zhang, H M Sabbir Ahmad, Ehsan Sabouni, Mainak Mondal, Song Han, Wenchao Li, Fei Miao

发表机构 * Department of ECE University of Connecticut(电子工程系,康涅狄格大学) School of Computing University of Connecticut(计算学院,康涅狄格大学) Department of ECE Boston University(电子工程系,波士顿大学)

AI总结 该论文研究了如何在自主车辆中实现鲁棒且安全的多智能体强化学习,特别是在从仿真环境到实际硬件的零样本迁移过程中。提出了一种名为RSR-RSMARL的新框架,通过考虑真实系统复杂性进行状态与动作表示,并结合鲁棒强化学习算法和基于控制屏障函数的安全模块,以增强系统在仿真与硬件中的安全性和协调性。实验表明,该方法在配备车对车通信的微型自主车辆上有效提升了多车场景下的驾驶安全与协作能力。

Comments 15 pages, 5 Figures

详情
英文摘要

Deep multi-agent reinforcement learning (MARL) has been demonstrated effectively in simulations for multi-robot problems. For autonomous vehicles, the development of vehicle-to-vehicle (V2V) communication technologies provide opportunities to further enhance system safety. However, zero-shot transfer of simulator-trained MARL policies to dynamic hardware systems remains challenging, and how to leverage communication and shared information for MARL has limited demonstrations on hardware. This problem is challenged by discrepancies between simulated and physical states, system state and model uncertainties, practical shared information design, and the need for safety guarantees in both simulation and hardware. This paper designs RSR-RSMARL, a novel Robust and Safe MARL framework that supports Real-Sim-Real (RSR) policy adaptation for multi-agent systems with communication among agents, with both simulation and hardware demonstrations. RSR-RSMARL leverages state (includes shared state information among agents) and action representations considering real system complexities for MARL formulation. The MARL policy is trained with robust MARL algorithm to enable zero-shot transfer to hardware considering the sim-to-real gap. A safety shield module using Control Barrier Functions (CBFs) provides safety guarantee for each individual agent. Experimental results on 1/10th-scale autonomous vehicles with V2V communication demonstrate the ability of RSR-RSMARL framework to enhance driving safety and coordination across multiple configurations. These findings emphasize the importance of jointly designing robust policy representations and modular safety architectures to enable scalable, generalizable RSR transfer in multi-agent autonomy.

2505.22445 2026-05-14 cs.CV cs.AI

NFR: Neural Feature-Guided Non-Rigid Shape Registration

Zhangquan Chen, Puhua Jiang, Mingze Sun, Ruqi Huang

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院)

AI总结 本文提出了一种基于神经特征引导的非刚性形状配准新框架,能够在无需对应关系标注的情况下,有效应对输入形状之间的显著非刚性变形和部分遮挡问题。该方法将深度学习形状匹配网络提取的神经特征融入迭代几何配准流程,既提升了对应关系的准确性和语义意义,又通过动态更新和一致性先验过滤增强了鲁棒性。实验表明,即使仅使用少量训练样本,该方法在多个非刚性点云配准和部分形状匹配基准上均达到最优性能,并能处理传统方法难以应对的复杂形变场景。

Comments 18 pages, 16 figures. arXiv admin note: substantial text overlap with arXiv:2311.04494

详情
英文摘要

In this paper, we propose a novel learning-based framework for 3D shape registration, which overcomes the challenges of significant non-rigid deformation and partiality undergoing among input shapes, and, remarkably, requires no correspondence annotation during training. Our key insight is to incorporate neural features learned by deep learning-based shape matching networks into an iterative, geometric shape registration pipeline. The advantage of our approach is two-fold -- On one hand, neural features provide more accurate and semantically meaningful correspondence estimation than spatial features (e.g., coordinates), which is critical in the presence of large non-rigid deformations; On the other hand, the correspondences are dynamically updated according to the intermediate registrations and filtered by consistency prior, which prominently robustify the overall pipeline. Empirical results show that, with as few as dozens of training shapes of limited variability, our pipeline achieves state-of-the-art results on several benchmarks of non-rigid point cloud matching and partial shape matching across varying settings, but also delivers high-quality correspondences between unseen challenging shape pairs that undergo both significant extrinsic and intrinsic deformations, in which case neither traditional registration methods nor intrinsic methods work. Our code is available at https://github.com/rqhuang88/NFR.

2505.18604 2026-05-14 cs.LG

Exemplar-Free Continual Learning for State Space Models

Isaac Ning Lee, Leila Mahmoodi, Trung Le, Mehrtash Harandi

发表机构 * Monash University(蒙纳士大学) National University of Singapore(新加坡国立大学)

AI总结 状态空间模型(SSMs)在捕捉长程依赖关系方面表现出色,但其内部状态的演变使得在连续学习(CL)场景下的适应变得困难,尤其是在无示例(exemplar-free)设置中,缺乏先前数据会导致灾难性遗忘。为了解决这一问题,本文提出了一种新的几何感知正则化方法Inf-SSM,利用无限维Grassmannian几何结构约束SSM状态的演化,通过求解Sylvester矩阵方程实现高效正则化,并设计了复杂度为$\mathcal{O}(n^2)$的求解方法,实验表明该方法在多个基准任务上显著减少了遗忘并提升了序列任务的准确性。

Comments Accepted at CVPR 2026

详情
英文摘要

State-Space Models (SSMs) excel at capturing long-range dependencies with structured recurrence, making them well-suited for sequence modeling. However, their evolving internal states pose challenges in adapting them under Continual Learning (CL). This is particularly difficult in exemplar-free settings, where the absence of prior data leaves updates to the dynamic SSM states unconstrained, resulting in catastrophic forgetting. To address this, we propose Inf-SSM, a novel and simple geometry-aware regularization method that utilizes the geometry of the infinite-dimensional Grassmannian to constrain state evolution during CL. Unlike classical continual learning methods that constrain weight updates, Inf-SSM regularizes the infinite-horizon evolution of SSMs encoded in their extended observability subspace. We show that enforcing this regularization requires solving a matrix equation known as the Sylvester equation, which typically incurs $\mathcal{O}(n^3)$ complexity. We develop a $\mathcal{O}(n^2)$ solution by exploiting the structure and properties of SSMs. This leads to an efficient regularization mechanism that can be seamlessly integrated into existing CL methods. Comprehensive experiments on challenging benchmarks, including ImageNet-R and Caltech-256, demonstrate a significant reduction in forgetting while improving accuracy across sequential tasks.

2505.17469 2026-05-14 cs.LG cs.AI cs.IT math.IT math.OC math.ST stat.TH

Efficient compression of neural networks and datasets

Lukas Silvester Barth, Paulo von Petersenn

发表机构 * Max Planck Institute for Mathematics in the Sciences(马克斯·普朗克数学研究院)

AI总结 本文探讨了神经网络与数据集的高效压缩问题,结合算法信息论与神经网络剪枝技术,提出了一种基于最小描述长度原则(MDL)的模型泛化优化方法。通过引入参数稀疏性作为模型描述长度的可计算近似,并改进稀疏优化算法,作者在图像和文本数据集上实现了显著的模型压缩,同时保持了较高的准确率。实验还验证了压缩模型在样本效率和泛化能力上的优势,支持了索洛莫诺夫归纳理论的预测。

Comments 10 pages plus appendix, 9 Figures, 6 Tables

详情
英文摘要

Compression and generalization are fundamentally related through Solomonoff induction and the minimum description length principle (MDL), which predict that simpler models generalize better when data arises from low-complexity distributions. In this article, we combine insights from algorithmic information theory and techniques from neural network pruning to improve model generalization by identifying the most effective data compression method. Since exact MDL optimization is intractable, we cast it as $\ell_0$ regularized learning and explain why parameter sparsity provides an effective computable approximation of model description length. To identify the best practical approach, we systematically compare and refine complementary sparse optimization methods. In particular, we improve probabilistic pruning through a procedure that does not require Monte Carlo sampling and refine smooth $\ell_0$ approximations with a binary search routine that reduces hyperparameter complexity. Across convolutional networks and transformers evaluated on image and text datasets, our refined methods improve upon their predecessors, achieve substantial model compression with minimal accuracy loss, and yield short data description lengths. Finally, we use these methods in a controlled teacher-student setting to empirically verify the prediction of Solomonoff induction that compressed models learn more sample-efficiently and generalize better.

2505.17101 2026-05-14 cs.CL cs.LG physics.comp-ph

A quantitative analysis of semantic information in deep representations of text and images

Santiago Acevedo, Andrea Mascaretti, Riccardo Rende, Matéo Mahaut, Marco Baroni, Alessandro Laio

发表机构 * Scuola Internazionale Superiore di Studi Avanzati (SISSA)(国际先进研究科学研究所(SISSA)) Universitat Pompeu Fabra (UPF)(庞培法拉大学(UPF)) Catalan Institute of Research and Advanced Studies (ICREA)(加泰罗尼亚研究与高级科学研究所(ICREA))

AI总结 本文研究了文本和图像的深度表示中语义信息的分布特性,提出了一种基于信息失衡的度量方法,用于量化不同表示之间的预测能力。研究发现,语义信息广泛分布在多个 token 中,且在模型中间层具有最强的预测能力,且这种现象在多种语言和模型间表现出一致性。研究还揭示了表示之间的预测能力存在显著不对称性,并在视觉和文本模态间发现了跨模态预测的最强层,支持了语义在不同语言、模态和架构中趋于收敛的假设。

详情
英文摘要

It was recently observed that the representations of different models that process identical or semantically related inputs tend to align. We analyze this phenomenon using the Information Imbalance, an asymmetric rank-based measure that quantifies the capability of a representation to predict another, providing a proxy of the cross-entropy which can be computed efficiently in high-dimensional spaces. By measuring the Information Imbalance between representations generated by DeepSeek-V3 processing translations, we find that semantic information is spread across many tokens, and that semantic predictability is strongest in a set of central layers of the network, robust across six language pairs. We measure clear information asymmetries: English representations are systematically more predictive than those of other languages, and DeepSeek-V3 representations are more predictive of those in a smaller model such as Llama3-8b than the opposite. In the visual domain, we observe that semantic information concentrates in middle layers for autoregressive models and in final layers for encoder models, and these same layers yield the strongest cross-modal predictability with textual representations of image captions. Our results support the hypothesis of semantic convergence across languages, modalities, and architectures, while showing that directed predictability between representations varies strongly with layer-depth, model scale, and language.

2505.12942 2026-05-14 cs.CL cs.AI cs.LG

A3 : an Analytical Low-Rank Approximation Framework for Attention

Jeffrey T. H. Wong, Cheng Zhang, Xinye Cao, Pedro Gimenes, Christos-Savvas Bouganis, George A. Constantinides, Wayne Luk, Yiren Zhao

发表机构 * Department of Electrical and Electronic Engineering, Imperial College London(帝国理工学院伦敦校区电子与电气工程系)

AI总结 大型语言模型虽然性能优异,但参数量庞大导致部署成本高昂。为此,本文提出了一种名为 $A^3$ 的后训练低秩近似框架,通过将 Transformer 层分解为三个功能组件,并在每个组件内进行分析性优化,有效降低模型参数、KV 缓存和计算量,同时避免引入运行时开销。实验表明,$A^3$ 在保持模型性能方面优于现有方法,例如在相同计算和内存压缩预算下,其对 LLaMA 3.1-70B 的近似版本在 WikiText-2 数据集上的困惑度显著优于当前最优方法。

详情
英文摘要

Large language models have demonstrated remarkable performance; however, their massive parameter counts make deployment highly expensive. Low-rank approximation offers a promising compression solution, yet existing approaches have two main limitations: (1) They focus on minimizing the output error of individual linear layers, without considering the architectural characteristics of Transformers, and (2) they decompose a large weight matrix into two small low-rank matrices. Consequently, these methods often fall short compared to other compression techniques like pruning and quantization, and introduce runtime overhead such as the extra GEMM kernel launches and memory operations for decomposed small matrices. To address these limitations, we propose $A^3$, a post-training low-rank approximation framework. $A^3$ splits a Transformer layer into three functional components, namely $\texttt{QK}$, $\texttt{OV}$, and $\texttt{MLP}$ and provides analytical solutions that reduces the hidden dimension size inside each component while minimizing the component's functional loss. This approach directly reduces model sizes, KV cache sizes, and FLOPs without introducing any runtime overheads. Through extensive experiments, we show that $A^3$ maintains superior performance compared to SoTAs. For example, under the same reduction budget in computation and memory, our low-rank approximated LLaMA 3.1-70B achieves a perplexity of 4.69 on WikiText-2, outperforming the previous SoTA's 7.87 by 3.18. We also show versatile applications of $A^3$ in KV cache compression, integration with quantization, fine-tuning and mixed-rank assignments. We open-sourced our framework and code at https://github.com/DeepWok/a3.

2505.12415 2026-05-14 cs.CL cs.AI

Table-R1: Region-based Reinforcement Learning for Table Understanding

Zhenhe Wu, Jian Yang, Zhongjiang He, Changzai Pan, Jie Zhang, Jiaheng Liu, Xianjie Wu, Yu Zhao, Shuangyong Song, Yongxiang Li, Zhoujun Li, Xueling Li

发表机构 * Beihang University(北京航空航天大学) Xingchen AGI Lab(星辰AGI实验室) China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd(中国电信人工智能技术(北京)有限公司) Institute of Artificial Intelligence (TeleAI), China Telecom(中国电信人工智能研究院(TeleAI))

AI总结 表格理解对语言模型提出了独特挑战,因其结构化的行列交互需要专门的方法。本文提出基于区域的强化学习方法Table-R1,通过引入区域增强的监督微调和表格感知的群体相对策略优化,有效提升了模型在表格问答中的表现。该方法结合文本、符号和程序推理,实现了对表格区域信息的精准利用,实验表明其在多个基准数据集上显著优于参数规模更大的基线模型。

详情
英文摘要

Tables present unique challenges for language models due to their structured row-column interactions, necessitating specialized approaches for effective comprehension. While large language models (LLMs) have demonstrated potential in table reasoning through prompting and techniques like chain-of-thought (CoT) and program-of-thought (PoT), optimizing their performance for table question answering remains underexplored. In this paper, we introduce region-based Table-R1, a novel reinforcement learning approach that enhances LLM table understanding by integrating region evidence into reasoning steps. Our method employs Region-Enhanced Supervised Fine-Tuning (RE-SFT) to guide models in identifying relevant table regions before generating answers, incorporating textual, symbolic, and program-based reasoning. Additionally, Table-Aware Group Relative Policy Optimization (TARPO) introduces a mixed reward system to dynamically balance region accuracy and answer correctness, with decaying region rewards and consistency penalties to align reasoning steps. Experiments show that Table-R1 achieves an average performance improvement of 14.36 points across multiple base models on three benchmark datasets, even outperforming baseline models with ten times the parameters, while TARPO reduces response token consumption by 67.5% compared to GRPO, significantly advancing LLM capabilities in efficient tabular reasoning.

2505.11556 2026-05-14 cs.CL cs.AI cs.MA

Systematic Failures in Collective Reasoning under Distributed Information in Multi-Agent LLMs

Yuxuan Li, Aoi Naito, Hirokazu Shirado

发表机构 * School of Computer Science, Carnegie Mellon University, Pittsburgh, USA(计算机科学学院,卡内基梅隆大学,匹兹堡,美国) School of Environment and Society, Institute of Science Tokyo, Tokyo, Japan(环境与社会学院,东京科学研究所,东京,日本)

AI总结 该研究探讨了基于大语言模型的多智能体系统在分布式信息下的集体推理能力,发现其存在系统性失效问题。研究提出了HiddenBench基准,通过隐藏档案范式隔离集体推理能力,实验表明多智能体系统在分布式信息下的准确率仅为30.1%,远低于单智能体在完整信息下的80.7%。研究指出,这种差距源于智能体无法识别和应对潜在的信息不对称,导致过早收敛于共享证据而忽略关键分布信息,并发现这一问题在不同模型和策略下普遍存在,但可通过结构化通信协议有效改善。

Comments Accepted to ICML 2026

详情
英文摘要

Multi-agent systems built on large language models (LLMs) are expected to enhance decision-making by pooling distributed information, yet systematically evaluating this capability has remained challenging. We introduce HiddenBench, a 65-task benchmark grounded in the Hidden Profile paradigm, which isolates collective reasoning under distributed information from individual reasoning ability. Evaluating 15 frontier LLMs, we find that multi-agent LLMs achieve only 30.1% accuracy under distributed information, compared to 80.7% accuracy for single agents given complete information. We trace this gap to a systematic failure mode: agents cannot recognize or act under latent information asymmetry -- they fail to reason about what others might know but have not yet expressed, leading to premature convergence on shared evidence while critical distributed facts remain unexplored. These failures persist across prompting strategies, communication depths, and group sizes -- and worsen as groups scale. While some models (e.g., Gemini-2.5-Flash/Pro) outperform others, neither model scale nor individual reasoning accuracy reliably predicts collective performance. We further show that this bottleneck is actionable: a lightweight structured communication protocol substantially improves collective reasoning across model families. Our results identify failures in collective information exploration in decision-making as a key limitation of multi-agent LLMs, and provide a theory-grounded, reproducible framework for diagnosing collective reasoning failures.

2505.05376 2026-05-14 cs.CV

GeomHair: Reconstruction of Hair Strands from Colorless 3D Scans

Rachmadio Noval Lazuardi, Artem Sevastopolsky, Egor Zakharov, Matthias Niessner, Vanessa Sklyarova

发表机构 * Technical University of Munich(慕尼黑技术大学) ETH Zürich(苏黎世联邦理工学院) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所)

AI总结 本文提出了一种从无颜色的3D扫描数据中直接重建发丝的新方法,通过多模态发丝方向提取技术实现。该方法利用神经网络检测扫描渲染中的表面特征,并结合扩散先验模型,仅依赖几何信息即可准确重建简单或复杂的发型。研究还构建了包含400个真实扫描重建发丝的Strands400数据集,为后续生成模型训练和计算机图形学应用提供了重要资源。

Comments 15 pages, 9 figures, 1 table

详情
英文摘要

We propose a novel method that reconstructs hair strands directly from colorless 3D scans by leveraging multi-modal hair orientation extraction. Hair strand reconstruction is a fundamental problem in computer vision and graphics, essential for high-fidelity digital avatar synthesis, animation, and AR/VR applications. However, accurately recovering hair strands from raw scan data remains challenging due to the complex and fine-grained structure of human hair, and none of the existing methods operate on colorless 3D geometry alone. To address this gap, our method directly identifies sharp surface features on the scan and estimates strand orientation using a neural 2D line detector applied to the renderings of scan shading. Additionally, we incorporate a diffusion prior trained on a diverse set of synthetic hair scans, refined with a noise schedule, and adapted to the reconstructed contents via a scan-specific text prompt. We demonstrate that this combination of supervision signals enables accurate reconstruction of both simple and intricate hairstyles from geometry alone. By enabling strand extraction from 3D scans, we compile Strands400, the largest publicly available dataset of hair strands with detailed surface geometry extracted from real-world data, comprising reconstructions from 400 subjects' scans. Strands400 enables training data-driven generative models for downstream tasks such as image-to-strands and text-to-strands. Moreover, our method applies to designer mesh assets, supporting a practical CG workflow where artists model hair as meshes and need strand-level representations for simulation and rendering. All code and data will be released for research purposes on https://seva100.github.io/GeomHair/.

2505.04152 2026-05-14 cs.CL cs.CY cs.HC

SocialLM: Social Signal Processing of Patient-Provider Communication using LLMs and Contextual Aggregation

Manas Satish Bedmutha, Feng Chen, Andrea Hartzler, Trevor Cohen, Nadir Weibel

发表机构 * University of Washington(华盛顿大学) UC San Diego(圣地亚哥大学)

AI总结 本研究探讨了如何利用大语言模型(LLMs)在无需微调的情况下,从临床对话中检测20种社会行为信号。研究发现,不同模型和提示策略在检测性能上存在差异,尤其受到患者种族和就诊段落的影响。为此,作者提出了一种基于群体一致性的加权集成方法,有效提升了检测的准确性和稳定性,为大规模临床对话中的社会信号追踪提供了可行的解决方案。

Comments To be presented at CHIL 2026

详情
英文摘要

Effective patient-provider communication is difficult to assess at scale. We examine whether large language models (LLMs) can track 20 social behaviors from clinical transcripts without fine-tuning. Across three model families and multiple prompting strategies, LLMs reliably detect social signals, though performance varies by patient race and visit segment. To address this variability under query-only API constraints, we introduce an agreement-weighted ensemble using group-level agreement patterns. This approach improves both accuracy and stability over the best individual model, demonstrating a practical pathway for scalable social signal tracking in clinical conversations.

2504.14129 2026-05-14 cs.CV

PVLM: Parsing-Aware Vision Language Model with Dynamic Contrastive Learning for Zero-Shot Deepfake Attribution

Yaning Zhang, Jiahe Zhang, Chunjie Ma, Weili Guan, Tian Gan, Zan Gao

发表机构 * Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences)(计算机科学与技术学院,齐鲁工业大学(山东省科学院)) Shandong University of Science and Technology(山东科技大学) Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences)(山东省人工智能研究院,齐鲁工业大学(山东省科学院)) School of Electronics and Information Engineering, Harbin Institute of Technology(电子与信息工程学院,哈尔滨工业大学(深圳)) School of Computer Science and Technology, Shandong University(计算机科学与技术学院,山东大学)

AI总结 本文提出了一种基于动态对比学习的解析感知视觉语言模型(PVLM),用于实现零样本深度伪造归因(ZSDFA),以有效追踪未见过的先进生成模型(如扩散模型)所产生的伪造人脸来源。该方法通过引入面部解析信息,捕捉生成模型在保留源人脸属性方面的差异,从而提升归因的细粒度与泛化能力。此外,研究还构建了一个新的零样本深度伪造归因基准,并设计了对比中心损失函数,进一步增强了模型对未知生成器的追踪性能,实验表明该方法在相关基准上优于现有最先进方法。

Comments Accepted to IEEE Transactions on Dependable and Secure Computing 2026

详情
英文摘要

The challenge of tracing the source attribution of forged faces has gained significant attention due to the rapid advancement of generative models. However, existing deepfake attribution (DFA) works primarily focus on the interaction among various domains in vision modality, and other modalities such as texts and face parsing are not fully explored. Besides, they tend to fail to assess the generalization performance of deepfake attributors to unseen advanced generators like diffusion in a fine-grained manner. In this paper, we propose a novel parsing-aware vision language model with a dynamic contrastive learning (PVLM) method for zero-shot deepfake attribution (ZSDFA), which facilitates effective and fine-grained traceability to unseen advanced generators. Specifically, we conduct a novel and fine-grained ZS-DFA benchmark to evaluate the attribution performance of deepfake attributors to unseen advanced generators like diffusion. Besides, we propose an innovative PVLM attributor based on the vision-language model to capture general and diverse attribution features. We are motivated by the observation that the preservation of source face attributes in facial images generated by GAN and diffusion models varies significantly. We propose to employ the inherent facial attributes preservation differences to capture face parsing-aware forgery representations. Therefore, we devise a novel parsing encoder to focus on global face attribute embeddings, enabling parsing-guided DFA representation learning via dynamic vision-parsing matching. Additionally, we present a novel deepfake attribution contrastive center loss to pull relevant generators closer and push irrelevant ones away, which can be introduced into DFA models to enhance traceability. Experimental results show that our model exceeds the state-of-the-art on the ZS-DFA benchmark via various protocol evaluations.

2504.11944 2026-05-14 cs.LG cs.AI

VIPO: Value Function Inconsistency Penalized Offline Reinforcement Learning

Xuyang Chen, Keyu Yan, Guojian Wang, Lin Zhao

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore, Singapore(新加坡国立大学电子与计算机工程系)

AI总结 VIPO 是一种基于模型的离线强化学习算法,旨在解决传统方法因模型误差而引入的保守性问题。该方法通过引入价值函数不一致性惩罚,利用自监督反馈提升模型训练效果,从而提高模型准确性。实验表明,VIPO 在多个基准测试中表现优异,显著优于现有方法,为模型基于的离线强化学习提供了一种通用且高效的改进框架。

详情
英文摘要

Offline reinforcement learning (RL) learns effective policies from pre-collected datasets, offering a practical solution for applications where online interactions are risky or costly. Model-based approaches are particularly advantageous for offline RL, owing to their data efficiency and generalizability. However, due to inherent model errors, model-based methods often artificially introduce conservatism guided by heuristic uncertainty estimation, which can be unreliable. In this paper, we introduce VIPO, a novel model-based offline RL algorithm that incorporates self-supervised feedback from value estimation to enhance model training. Specifically, the model is learned by additionally minimizing the inconsistency between the value learned directly from the offline data and the value estimated from the model. We perform comprehensive evaluations from multiple perspectives to show that VIPO can learn a highly accurate model efficiently and consistently outperform existing methods. In particular, it achieves state-of-the-art performance on almost all tasks in both D4RL and NeoRL benchmarks. Overall, VIPO offers a general framework that can be readily integrated into existing model-based offline RL algorithms to systematically enhance model accuracy.

2503.19719 2026-05-14 cs.LG cs.AI cs.CV

On What Depends the Robustness of Multi-source Models to Missing Data in Earth Observation?

Francisco Mena, Diego Arenas, Miro Miranda, Andreas Dengel

发表机构 * University of Kaiserslautern-Landau (RPTU)(凯撒斯劳滕-兰道大学(RPTU)) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI))

AI总结 本文研究了多源模型在遥感观测中面对数据缺失时的鲁棒性影响因素。通过评估六种先进多源模型在单一数据源缺失或仅有一个数据源可用时的预测性能,发现模型效果与任务特性、数据源互补性及模型设计密切相关。研究还发现,去除某些数据源反而可能提升预测性能,挑战了“数据越多越好”的传统假设,引发了对模型复杂性和数据必要性的深入思考。

Comments Accepted at IEEE International Geoscience and Remote Sensing Symposium 2025

Journal ref 2025 IEEE International Geoscience and Remote Sensing Symposium

详情
英文摘要

In recent years, the development of robust multi-source models has emerged in the Earth Observation (EO) field. These are models that leverage data from diverse sources to improve predictive accuracy when there is missing data. Despite these advancements, the factors influencing the varying effectiveness of such models remain poorly understood. In this study, we evaluate the predictive performance of six state-of-the-art multi-source models in predicting scenarios where either a single data source is missing or only a single source is available. Our analysis reveals that the efficacy of these models is intricately tied to the nature of the task, the complementarity among data sources, and the model design. Surprisingly, we observe instances where the removal of certain data sources leads to improved predictive performance, challenging the assumption that incorporating all available data is always beneficial. These findings prompt critical reflections on model complexity and the necessity of all collected data sources, potentially shaping the way for more streamlined approaches in EO applications.

2502.02270 2026-05-14 cs.LG math.OC stat.ML

Exact Sequence Interpolation with Transformers

Albert Alcalde, Giovanni Fantuzzi, Enrique Zuazua

发表机构 * Chair for Dynamics, Control, Machine Learning, and Numerics (Alexander von Humboldt Professorship)(动力学、控制、机器学习和数值学教授席位(亚历山大·冯·洪堡教授职位)) Friedrich–Alexander–Universität Erlangen–Nürnberg(弗里德里希-亚历山大-埃尔兰根-纽伦堡大学) Departamento de Matemáticas(数学系) Universidad Autónoma de Madrid(马德里自治大学) Chair of Computational Mathematics(计算数学教授席位) Fundación Deusto(德乌斯基金会)

AI总结 本文研究了变压器模型在有限输入序列插值问题中的能力,证明了其可以在实数空间中精确插值任意长度的输入序列及其对应输出序列。通过交替使用前馈层和自注意力层,并结合自注意力机制中的聚类效应,作者构建了一个参数数量与输入序列长度无关的变压器模型,实现了精确插值。此外,该方法还引入了低秩参数矩阵,提升了模型的实用性,并将结果从硬最大自注意力扩展到软最大自注意力,同时提供了正则化训练下的收敛性保证,为理解变压器模型的理论性能提供了新视角。

Comments 36 pages, 9 figures. Funded by the European Union (Horizon Europe MSCA project ModConFlex, grant number 101073558)

详情
英文摘要

We prove that transformers can exactly interpolate datasets of finite input sequences in $\mathbb{R}^d$, $d\geq 2$, with corresponding output sequences of smaller or equal length. Specifically, given $N$ sequences of arbitrary but finite lengths in $\mathbb{R}^d$ and output sequences of lengths $m^1, \dots, m^N \in \mathbb{N}$, we construct a transformer with $\mathcal{O}(\sum_{j=1}^N m^j)$ blocks and $\smash{\mathcal{O}(d \sum_{j=1}^N m^j)}$ parameters that exactly interpolates the dataset. Our construction provides complexity estimates that are independent of the input sequence length, by alternating feed-forward and self-attention layers and by capitalizing on the clustering effect inherent to the latter. Our novel constructive method also uses low-rank parameter matrices in the self-attention mechanism, a common feature of practical transformer implementations. These results are first established in the hardmax self-attention setting, where the geometric structure permits an explicit and quantitative analysis, and are then extended to the softmax setting. Finally, we demonstrate the applicability of our exact interpolation construction to learning problems, in particular by providing convergence guarantees to a global minimizer under regularized training strategies. Our analysis contributes to the theoretical understanding of transformer models, offering an explanation for their excellent performance in exact sequence-to-sequence interpolation tasks.

2501.17443 2026-05-14 cs.LG

Gradual Domain Adaptation for Graph Learning

Pui Ieng Lei, Ximing Chen, Yijun Sheng, Yanyan Liu, Zhiguo Gong, Qiang Yang

发表机构 * Faculty of Science and Technology, University of Macau(澳门大学科学与技术学院) PolyU Academy for Artificial Intelligence, Hong Kong Polytechnic University(香港理工大学人工智能学院)

AI总结 本文研究了图学习中的渐进域适应问题,旨在应对源图与目标图之间大规模分布差异的挑战。提出了一种基于融合格罗莫夫-瓦萨尔距离(FGW)的渐进域适应框架(GGDA),通过生成知识保留的中间图并构建紧凑的域序列,以最小化适应过程中的信息损失。该方法引入顶点级的渐进策略,提升域间迁移能力,并提供了可操作的理论界以指导域序列的优化构造,实验表明其在多种迁移场景中表现优异。

Comments Accepted by ACM Trans. Intell. Syst. Technol. (https://doi.org/10.1145/3815185)

详情
英文摘要

Existing machine learning literature lacks graph-based domain adaptation techniques capable of handling large distribution shifts, primarily due to the difficulty in simulating a coherent evolutionary path from source to target graph. To meet this challenge, we present a graph gradual domain adaptation (GGDA) framework, which constructs a compact domain sequence that minimizes information loss during adaptation. Our approach starts with an efficient generation of knowledge-preserving intermediate graphs over the Fused Gromov-Wasserstein (FGW) metric. A GGDA domain sequence is then constructed upon this bridging data pool through a novel vertex-based progression, which involves selecting "close" vertices and performing adaptive domain advancement to enhance inter-domain transferability. Theoretically, our framework provides implementable upper and lower bounds for the intractable inter-domain Wasserstein distance, $W_p(μ_t,μ_{t+1})$, enabling its flexible adjustment for optimal domain formation. Extensive experiments across diverse transfer scenarios demonstrate the superior performance of our GGDA framework.

2412.06341 2026-05-14 cs.CV cs.AI

Visual Accommodation: Rethinking Image Scale as a Learnable Variable for Object Detection

Daeun Seo, Hoeseok Yang, Sihyeong Park, Hyungshin Kim

发表机构 * Chungnam National University(Chungnam 国立大学) Santa Clara University(Santa Clara 大学) Korea Electronics Technology Institute(韩国电子技术研究所)

AI总结 本文提出了一种名为Ciliary-DETR的框架,旨在通过学习可变的图像尺度来提升目标检测在测试阶段的适应能力,类似于生物视觉中的调节机制。该方法引入了一个轻量级的尺度预测器,能够在不同输入尺度下动态估计最优的测试尺度因子,从而提高检测的灵活性和鲁棒性。通过引入参数化的尺度优化目标,解决了在标准训练设置下最优输入尺度不可观测的问题,实现了高效的一次性推理过程。

Comments 23 pages, 11 figures

详情
英文摘要

We propose Ciliary-DETR (previous name: Elastic-DETR), a framework for test-time resolution adjustment analogous to biological accommodation. While multi-scale data augmentation improves robustness to scale variation, modern detectors rely on fixed inference resolutions, potentially limiting flexibility and robustness. Similar to the ciliary muscle, we introduce a lightweight scale predictor that dynamically estimates test-time scale factors across a wide range of input scales. The core challenge is that the optimal input scale is inherently unobservable under standard training setups. To address this challenge, we introduce a parametric formulation of desired scaling behavior, leading to loss-driven objectives that guide scale optimization. Overall, our method enables flexible and efficient single-pass inference, bridging the gap between training-time robustness and test-time adaptation.

2411.15913 2026-05-14 cs.SD cs.AI cs.LG eess.AS

Repurposing Image Diffusion Models for Training-Free Music Style Transfer on Mel-spectrograms

Heehwan Wang, Joonwoo Kwon, Sooyoung Kim, Jungwoo Seo, Shinjae Yoo, Yuewei Lin, Jiook Cha

发表机构 * Seoul National University(首尔国立大学) Michigan State University(密歇根州立大学) Rutgers University(罗格斯大学) Brookhaven National Laboratory(布鲁克海文国家实验室)

AI总结 该研究提出了一种无需训练的音乐风格迁移方法Stylus,通过复用预训练的图像扩散模型,在梅尔频谱图域实现音乐风格迁移。该方法将音频视为结构化的时频图像,通过注入风格键值对操控自注意力机制,同时保留源音频的结构查询,从而在保持内容结构的同时实现风格迁移。实验表明,Stylus在内容保留和感知质量上均优于现有方法,验证了通用图像先验在结构化梅尔频谱图无训练迁移中的有效性。

Comments Accepted by ICIP 2026

详情
英文摘要

Music style transfer blends source structure with reference style to enable personalized music creation. However, existing zero-shot methods often struggle to capture fine-grained audio nuances, relying on coarse text descriptions or requiring expensive task-specific training. We propose Stylus, a training-free framework that repurposes pretrained image diffusion models for music style transfer in the Mel-spectrogram domain. By treating audio as structured time-frequency images, Stylus manipulates self-attention by injecting style keys and values while preserving source structural queries. To ensure high fidelity, we introduce a phase-preserving reconstruction strategy to mitigate spectrogram inversion artifacts, alongside a classifier-free-guidance-inspired control for adjustable stylization. Extensive evaluations including 2,925 human ratings demonstrate that Stylus outperforms state-of-the-art baselines, achieving 34.1% higher content preservation and 25.7% better perceptual quality. Our work validates that generic image priors can be effectively leveraged for the training-free transformation of structured Mel-spectrograms. Code and materials are available at https://github.com/Sooyyoungg/Stylus.git.

2410.14375 2026-05-14 cs.LG cs.CL

Causal Fine-Tuning under Latent Confounded Shift

Jialin Yu, Yuxiang Zhou, Haoxuan Li, Junchi Yu, Mengyue Yang, Yulan He, Nevin L. Zhang, Philip Torr, Ricardo Silva

发表机构 * University of Oxford, United Kingdom(牛津大学) Queen Mary University of London, United Kingdom(伦敦玛丽女王大学) Peking University, China(北京大学) University of Bristol, United Kingdom(布里斯托大学) King's College London, United Kingdom(伦敦国王学院) Hong Kong University of Science(香港科学大学) University College London, United Kingdom(伦敦大学学院)

AI总结 在现实场景中,由于隐藏变量导致的潜在混淆偏移是AI模型适应性的一个核心挑战。本文提出了一种名为Causal Fine-Tuning(CFT)的方法,通过引入结构因果模型作为归纳偏置,将表征分解为稳定和易变两部分,从而提升模型对非因果捷径的鲁棒性。实验表明,该方法在文本任务中优于传统领域泛化基线,有效提升了模型在面对虚假相关性攻击时的表现。

Comments ICML 2026 Camera Ready Version

详情
英文摘要

Adapting to latent confounded shift remains a core challenge in modern AI. This setting is driven by hidden variables that induce spurious correlations between inputs and outputs during training, leading models to rely on non-causal shortcuts. For example, a model may learn to treat metadata (e.g., data source like "Amazon") as a proxy for positive sentiment, causing failure when the source becomes predominantly negative during deployment. To address this latent confounded shift, we introduce Causal Fine-Tuning(CFT). Using a structural causal model as an inductive bias, we derive sufficient identification conditions that motivate a fine-tuning objective for decomposing representations into high-level stable and low-level shift-sensitive components. Instantiating this framework in BERT, we show that learning such causal/spurious representations and adjusting them accordingly yield a more robust predictor. Experiments on spurious correlation injection attacks in text demonstrate that our method outperforms black-box domain generalization baselines, highlighting the benefits of explicitly modeling causal structure.

2408.15621 2026-05-14 cs.LG cs.CR

Convergent Differential Privacy Analysis for General Federated Learning

Yan Sun, Qixin Zhang, Li Shen, Dacheng Tao

发表机构 * School of Computer Science Faculty of Engineering(计算机科学学院工程学院) University of Sydney(悉尼大学) Generative AI Lab(生成式人工智能实验室) College of Computing and Data Science(计算与数据科学学院) Nanyang Technological University(南洋理工大学) School of Cyber Science and Technology(网络安全科学与技术学院) Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区)

AI总结 本文研究联邦学习与差分隐私结合框架下的收敛隐私分析问题,指出现有分析方法在通信轮次较多时会导致隐私泄露上界过于宽松。为此,作者基于$f$-DP分析框架,对两种经典方法进行了深入评估,并利用插值技术证明了Noisy-FedAvg的隐私泄露具有紧致收敛界,Noisy-FedProx则具有稳定的常数下界,为联邦学习中隐私保障的可靠性提供了坚实的理论依据。

详情
英文摘要

The powerful cooperation of federated learning (FL) and differential privacy~(DP) provides a promising paradigm for the large-scale private clients. However, existing analyses in FL-DP mostly rely on the composition theorem and cannot tightly quantify the privacy leakage challenges, which is tight for a few communication rounds but yields an arbitrarily loose and divergent bound eventually. This also implies a counterintuitive judgment, suggesting that FL-DP may not provide adequate privacy support during long-term training under constant-level noisy perturbations, yielding discrepancy between the theoretical and experimental results. To further investigate the convergent privacy and reliability of the FL-DP framework, in this paper, we comprehensively evaluate the worst privacy of two classical methods under the non-convex and smooth objectives based on the $f$-DP analysis. With the aid of the shifted interpolation technique, we successfully prove that privacy in {\ttfamily Noisy-FedAvg} has a tight convergent bound. Moreover, with the regularization of the proxy term, privacy in {\ttfamily Noisy-FedProx} has a stable constant lower bound. Our analysis further demonstrates a solid theoretical foundation for the reliability of privacy in FL-DP. Meanwhile, our conclusions can also be losslessly converted to other classical DP analytical frameworks, e.g. $(ε,δ)$-DP and R$\acute{\text{e}}$nyi-DP~(RDP), to provide more fine-grained understandings for the FL-DP frameworks.

2408.12935 2026-05-14 cs.AI

AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions

Chen Chen, Xueluan Gong, Ziyao Liu, Weifeng Jiang, Si Qi Goh, Kwok-Yan Lam

发表机构 * College of Computing and Data Science(计算与数据科学学院)

AI总结 本文探讨了大型语言模型(LLM)背景下人工智能安全(AI Safety)的关键问题,提出了一种从可信AI、责任AI和安全AI三个维度理解AI安全的新框架。通过综述当前研究进展与挑战,并结合前沿技术案例,文章总结了AI安全设计与测试的创新方法,旨在推动该领域研究发展,提升人们对数字化转型的信任。

详情
英文摘要

AI Safety is an emerging area of critical importance to the safe adoption and deployment of AI systems. With the rapid proliferation of AI and especially with the recent advancement of Generative AI (or GAI), the technology ecosystem behind the design, development, adoption, and deployment of AI systems has drastically changed, broadening the scope of AI Safety to address impacts on public safety and national security. In this paper, we propose a novel architectural framework for understanding and analyzing AI Safety; defining its characteristics from three perspectives: Trustworthy AI, Responsible AI, and Safe AI. We provide an extensive review of current research and advancements in AI safety from these perspectives, highlighting their key challenges and mitigation approaches. Through examples from state-of-the-art technologies, particularly Large Language Models (LLMs), we present innovative mechanism, methodologies, and techniques for designing and testing AI safety. Our goal is to promote advancement in AI safety research, and ultimately enhance people's trust in digital transformation.

2408.01119 2026-05-14 cs.CL

Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer

Robert Belanec, Simon Ostermann, Ivan Srba, Maria Bielikova

发表机构 * Faculty of Information Technology, Brno University of Technology(布拉格技术大学信息学院) Kempelen Institute of Intelligent Technologies(智能技术研究所) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心)

AI总结 该论文提出了一种名为“任务提示向量”(Task Prompt Vectors)的新方法,用于改进软提示调优在多任务场景下的模块性。该方法通过计算调优后软提示与随机初始化之间的逐元素差异,生成任务提示向量,并验证其在低资源条件下能够有效初始化相似任务的提示调优。实验表明,这些向量不受随机初始化和模型架构的影响,支持跨任务的提示算术操作,为多任务提示调优提供了高效且具有竞争力的替代方案。

详情
英文摘要

Prompt tuning is an efficient solution for training large language models (LLMs). However, current soft-prompt-based methods often sacrifice multi-task modularity, requiring the training process to be fully or partially repeated for each newly added task. While recent work on task vectors applied arithmetic operations on full model weights to achieve the desired multi-task performance, a similar approach for soft-prompts is still missing. To this end, we introduce Task Prompt Vectors, created by element-wise difference between weights of tuned soft-prompts and their random initialization. Experimental results on 12 NLU datasets show that task prompt vectors can be used in low-resource settings to effectively initialize prompt tuning on similar tasks. In addition, we show that task prompt vectors are independent of the random initialization of prompt tuning on 2 different language model architectures. This allows prompt arithmetics with the pre-trained vectors from different tasks. In this way, we provide a competitive alternative to state-of-the-art baselines by arithmetic addition of task prompt vectors from multiple tasks.

2407.15512 2026-05-14 cs.LG cs.AI cs.CV

Increasing the Robustness of Model Predictions to Missing Sensors in Earth Observation

Francisco Mena, Diego Arenas, Andreas Dengel

发表机构 * University of Kaiserslautern-Landau, Kaiserslautern, Germany(凯撒斯劳滕-兰道大学,凯撒斯劳滕,德国) German Research Center for Artificial Intelligence, Kaiserslautern, Germany(德国人工智能研究中心,凯撒斯劳滕,德国)

AI总结 该研究旨在提高地球观测中多传感器机器学习模型在传感器缺失情况下的预测鲁棒性。作者提出了两种新方法:输入传感器丢弃(ISensD)和集成传感器不变(ESensI),通过实验验证了它们在三个多传感器时序数据集上的有效性。研究发现,集成多传感器模型在面对传感器缺失时表现最为稳健,而ISensD中的传感器丢弃机制也展现出良好的鲁棒性。

Comments Accepted at the MACLEAN workshop in the ECML/PKDD 2024

Journal ref Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2024

详情
英文摘要

Multi-sensor ML models for EO aim to enhance prediction accuracy by integrating data from various sources. However, the presence of missing data poses a significant challenge, particularly in non-persistent sensors that can be affected by external factors. Existing literature has explored strategies like temporal dropout and sensor-invariant models to address the generalization to missing data issues. Inspired by these works, we study two novel methods tailored for multi-sensor scenarios, namely Input Sensor Dropout (ISensD) and Ensemble Sensor Invariant (ESensI). Through experimentation on three multi-sensor temporal EO datasets, we demonstrate that these methods effectively increase the robustness of model predictions to missing sensors. Particularly, we focus on how the predictive performance of models drops when sensors are missing at different levels. We observe that ensemble multi-sensor models are the most robust to the lack of sensors. In addition, the sensor dropout component in ISensD shows promising robustness results.

2407.01602 2026-05-14 cs.CL cs.LG math.DS stat.ML

Clustering in pure-attention hardmax transformers and its role in sentiment analysis

Albert Alcalde, Giovanni Fantuzzi, Enrique Zuazua

发表机构 * FAU(弗赖堡大学)

AI总结 本文研究了纯注意力机制中使用硬max自注意力和归一化子层的Transformer模型在层数趋于无穷时的行为,揭示了其输入会收敛到由特定“领导者”点决定的聚类平衡状态。通过将Transformer视为欧几里得空间中的离散时间动力系统,并结合超平面分离的几何解释,作者提出了一个可解释的Transformer模型,用于情感分析任务,能够通过围绕有意义“领导者”词聚类无意义词来有效捕捉上下文信息。该研究为理解Transformer的数学特性提供了理论基础,并指出了理论分析与实际应用之间的挑战。

Comments 23 pages, 11 figures, 1 table. Funded by the European Union (Horizon Europe MSCA project ModConFlex, grant number 101073558). Accompanying code available at: https://github.com/DCN-FAU-AvH/clustering-hardmax-transformers

Journal ref SIAM Journal on Mathematics of Data Science 7(3):1367-1393, 2025

详情
英文摘要

Transformers are extremely successful machine learning models whose mathematical properties remain poorly understood. Here, we rigorously characterize the behavior of transformers with hardmax self-attention and normalization sublayers as the number of layers tends to infinity. By viewing such transformers as discrete-time dynamical systems describing the evolution of points in a Euclidean space, and thanks to a geometric interpretation of the self-attention mechanism based on hyperplane separation, we show that the transformer inputs asymptotically converge to a clustered equilibrium determined by special points called \textit{leaders}. We then leverage this theoretical understanding to solve sentiment analysis problems from language processing using a fully interpretable transformer model, which effectively captures `context' by clustering meaningless words around leader words carrying the most meaning. Finally, we outline remaining challenges to bridge the gap between the mathematical analysis of transformers and their real-life implementation.

2406.05410 2026-05-14 cs.AI cs.CL

ChatSR: Multimodal Large Language Models for Scientific Formula Discovery

Yanjie Li, Lina Yu, Weijun Li, Min Wu, Liping Zhang, Jingyi Liu, Yusong Deng, Mingzhu Wan, Xin Ning

发表机构 * AnnLab, Institute of Semiconductors, Chinese Academy of Sciences, Beijing, China(安 lab,半导体研究所,中国科学院,北京,中国) School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing, China(电子、电气与通信工程学院,中国科学院大学,北京,中国) Zhongguancun Academy, Beijing, China(中关村学院,北京,中国) School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences, Beijing 101408, China(先进交叉科学学院,中国科学院大学,北京101408,中国) College of Materials Science and Opto-Electronic Technology, University of Chinese Academy of Sciences, Beijing, 100049, China(材料科学与光电技术学院,中国科学院大学,北京100049,中国) School of Integrated Circuits, University of Chinese Academy of Sciences, Beijing 100049, China(集成电路学院,中国科学院大学,北京100049,中国)

AI总结 当前多模态大语言模型主要关注图像、视频等感知模态的理解与处理,但在科学数据理解方面仍存在不足。为此,研究提出ChatSR,一种专门针对科学数据理解的新型多模态大语言模型。该模型将科学数据视为一种类似于视觉内容的新模态,通过精心设计的编码器和模态对齐机制,将其映射到大语言模型可处理的表征空间,从而捕捉科学数据的结构特征和内在规律,并基于用户指定的先验约束和偏好自动生成符合领域知识的数学公式,推动科学发现的自动化。实验表明,ChatSR在13个数据集上取得了最先进的性能,并展现出强大的零样本学习能力。

Comments 14 pages,

详情
英文摘要

Current multimodal large language models (MLLMs) are mainly focused on the understanding and processing of perceptual modalities such as images and videos, while their capability for scientific data understanding remains insufficient. To this end, we propose ChatSR, a novel multimodal large language model tailored for scientific data understanding. ChatSR treats scientific data as a new modality analogous to visual content and, through carefully designed encoders and modality alignment mechanisms, maps scientific data into a representation space that can be processed by large language models, enabling the model to grasp the structural characteristics and underlying regularities of scientific data. Building on this foundation, ChatSR further exploits the rich domain knowledge and strong reasoning abilities of large language models to emulate a knowledgeable human scientist: based on user-specified prior constraints and preferences expressed (such as requirements on periodicity, symmetry, etc.), it automatically generates mathematical formulas that not only accurately fit the observed data but also conform to domain priors, thereby characterizing the latent laws embodied in scientific data and promoting the automation of scientific discovery. Experiments on 13 datasets show that ChatSR achieves state-of-the-art performance on traditional symbolic regression benchmarks. In addition, ChatSR exhibits a promising zero-shot ability to understand and utilize types of prior knowledge that are not present in its training data.

2403.11247 2026-05-14 cs.CV cs.RO

Compact 3D Gaussian Splatting For Dense Visual SLAM

Tianchen Deng, Chang Nie, Shuhong Liu, Wenhua Wu, Jianfei Yang, Shenghai Yuan, Jiuming Liu, Danwei Wang, Hesheng Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) The University of Tokyo(东京大学) Harvard University(哈佛大学) University of Cambridge(剑桥大学)

AI总结 本文提出了一种紧凑的3D高斯溅射SLAM系统,旨在解决现有方法中因大量冗余高斯椭球体导致的高内存消耗和训练速度慢的问题。通过引入基于滑动窗口的掩码策略和几何码本压缩技术,有效减少了高斯椭球体的数量和参数规模。实验表明,该方法在保持场景重建质量的同时,显著提升了训练和渲染速度。

Comments Accepted by IJCV 2026

详情
英文摘要

Recent work has shown that 3D Gaussian-based SLAM enables high-quality reconstruction, accurate pose estimation, and real-time rendering of scenes. However, these approaches are built on a tremendous number of redundant 3D Gaussian ellipsoids, leading to high memory and storage costs, and slow training speed. To address the limitation, we propose a compact 3D Gaussian Splatting SLAM system that reduces the number and the parameter size of Gaussian ellipsoids. A sliding window-based masking strategy is first proposed to reduce the redundant ellipsoids. Then we observe that the covariance matrix (geometry) of most 3D Gaussian ellipsoids are extremely similar, which motivates a novel geometry codebook to compress 3D Gaussian geometric attributes, i.e., the parameters. Robust and accurate pose estimation is achieved by a global bundle adjustment method with reprojection loss. Extensive experiments demonstrate that our method achieves faster training and rendering speed while maintaining the state-of-the-art (SOTA) quality of the scene representation.

2402.05576 2026-05-14 cs.LG

Tighter Learning Guarantees on Digital Computers via Concentration of Measure on Finite Spaces

Anastasis Kratsios, A. Martina Neuman, Gudmund Pammer

发表机构 * Department of Mathematics, McMaster University and The Vector Institute(麦吉尔大学数学系和向量研究所) Faculty of Mathematics, University of Vienna(维也纳大学数学系) Department of Mathematics, ETH Zürich(苏黎世联邦理工学院数学系)

AI总结 本文研究了在数字计算机上实现的机器学习模型的泛化能力,针对输入空间为欧几里得空间的情形,提出了适用于离散学习问题的新型泛化界。通过引入几何表示维度 $m$,作者推导出一系列依赖于样本量 $N$ 和 $m$ 的泛化界,能够在实际样本规模下提供更紧的上界估计。研究还基于有限度量空间中的非渐近集中不等式,为数字计算机上的学习提供了更精确的理论保证。

详情
英文摘要

Machine learning models with inputs in a Euclidean space $\mathbb{R}^d$, when implemented on digital computers, generalize, and their generalization gap converges to $0$ at a rate of $c/N^{1/2}$ concerning the sample size $N$. However, the constant $c>0$ obtained through classical methods can be large in terms of the ambient dimension $d$ and machine precision, posing a challenge when $N$ is small to realistically large. In this paper, we derive a family of generalization bounds $\{c_m/N^{1/(2\vee m)}\}_{m=1}^{\infty}$ tailored for learning models on digital computers, which adapt to both the sample size $N$ and the so-called geometric representation dimension $m$ of the discrete learning problem. Adjusting the parameter $m$ according to $N$ results in significantly tighter generalization bounds for practical sample sizes $N$, while setting $m$ small maintains the optimal dimension-free worst-case rate of $\mathcal{O}(1/N^{1/2})$. Notably, $c_{m}\in \mathcal{O}(m^{1/2})$ for learning models on discretized Euclidean domains. Furthermore, our adaptive generalization bounds are formulated based on our new non-asymptotic result for concentration of measure in finite metric spaces, established via leveraging metric embedding arguments.