arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4046
2507.04465 2026-05-12 cs.CV

Visual Hand Gesture Recognition with Deep Learning: A Comprehensive Review of Methods, Datasets, Challenges and Future Research Directions

Konstantinos Foteinos, Manousos Linardakis, Panagiotis Radoglou-Grammatikis, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos

AI总结 本文综述了深度学习在视觉手部手势识别(VHGR)领域的研究进展,系统梳理了主流方法、常用数据集及评估指标,旨在为研究人员提供全面的参考指南。文章围绕VHGR的四个核心问题展开,分析当前最先进的方法,并对比不同任务下的性能差异,指出了该领域面临的主要挑战及未来研究方向。

Comments Submitted to Neurocomputing. Rewritten abstract, due to limited space

详情
英文摘要

The rapid evolution of deep learning (DL) models and the ever-increasing size of available datasets have raised the interest of the research community in the always-important field of visual hand gesture recognition (VHGR), and delivered a wide range of applications, such as sign language understanding and human-computer interaction. Despite the large volume of research works in the field, a structured and complete survey on VHGR is still missing, leaving researchers to navigate through hundreds of papers in order to find the current state-of-the-art (SOTA). The current survey aims to fill this gap by presenting a comprehensive overview of this computer vision field. With a systematic research methodology and a structured presentation of the various methods, datasets, and evaluation metrics, this review aims to constitute a useful guideline for researchers, helping them to propose improvements. Specifically, this survey focuses on four fundamental questions: what are the main VHGR aspects, what are the current SOTA methods, what comparative insights can be drawn across methods and tasks, and which challenges shape future research. Starting with the methodology used to locate the related literature, the survey identifies and organizes the key VHGR approaches in a taxonomy-based format. The SOTA methods are grouped across three primary VHGR tasks: static, isolated dynamic and continuous gesture recognition. For each task, the architectural trends and learning strategies are listed. To support the experimental evaluation of future methods in the field, the study reviews commonly used datasets and presents the standard performance metrics. Our survey concludes by identifying the major challenges in VHGR, including both general computer vision issues and domain-specific obstacles, and outlines promising directions for future research.

2507.04277 2026-05-12 cs.CV

Towards Lightest Low-Light Image Enhancement Architecture for Mobile Devices

Guangrui Bai, Hailong Yan, Wenhai Liu, Yahui Deng, Erbao Dong

AI总结 本文提出了一种名为LiteIE的轻量级低光照图像增强框架,旨在在移动设备上实现实时、高效的图像增强。该方法无需大规模标注数据,采用仅含两个卷积层的骨干无关特征提取器和参数免费的迭代修复模块,显著降低了计算量和参数数量。实验表明,LiteIE在LOL数据集上取得了优于现有方法1.4 dB的PSNR性能,且参数仅为同类方法的0.07%,在移动处理器上可实现每秒30帧的4K图像处理,适用于资源受限的边缘设备部署。

Comments Accepted by ESWA

详情
英文摘要

Real-time low-light image enhancement on mobile and embedded devices requires models that balance visual quality and computational efficiency. Existing deep learning methods often rely on large networks and labeled datasets, limiting their deployment on resource-constrained platforms. In this paper, we propose LiteIE, an ultra-lightweight unsupervised enhancement framework that eliminates dependence on large-scale supervision and generalizes well across diverse conditions. We design a backbone-agnostic feature extractor with only two convolutional layers to produce compact image features enhancement tensors. In addition, we develop a parameter-free Iterative Restoration Module, which reuses the extracted features to progressively recover fine details lost in earlier enhancement steps, without introducing any additional learnable parameters. We further propose an unsupervised training objective that integrates exposure control, edge-aware smoothness, and multi-scale color consistency losses. Experiments on the LOL dataset, LiteIE achieves 19.04 dB PSNR, surpassing SOTA by 1.4 dB while using only 0.07\% of its parameters. On a Snapdragon 8 Gen 3 mobile processor, LiteIE runs at 30 FPS for 4K images with just 58 parameters, enabling real-time deployment on edge devices. These results establish LiteIE as an efficient and practical solution for low-light enhancement on resource-limited platforms.

2506.21095 2026-05-12 cs.LG cs.AI

FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation

Xenia Heilmann, Luca Corbucci, Mattia Cerrato, Anna Monreale

AI总结 联邦学习(FL)在保护隐私的同时实现了协作训练,但带来了“公平性幻觉”的问题:全局模型在服务器端看起来平均公平,却在客户端层面持续存在歧视。现有增强公平性的联邦学习方法通常只针对单一敏感属性进行偏差缓解,忽略了属性偏差和值偏差这两种现实且冲突的情况。为此,本文提出了FeDa4Fair,首个用于在异构客户端偏差条件下测试公平性方法的基准框架,包含定制数据集生成库、标准化评估套件以及公平性评估函数,为更稳健和可复现的联邦学习公平性研究提供了支持。

Comments Accepted at ACM FAccT 2026

详情
英文摘要

Federated Learning (FL) enables collaborative training while preserving privacy, yet it introduces a critical challenge: the "illusion of fairness''. A global model, usually evaluated on the server, appears fair on average while keeping persistent discrimination at the client level. Current fairness-enhancing FL solutions often fall short, as they typically mitigate biases for a single, usually binary, sensitive attribute, while ignoring two realistic and conflicting scenarios: attribute-bias (where clients are unfair toward different sensitive attributes) and value-bias (where clients exhibit conflicting biases toward different values of the same attribute). To support more robust and reproducible fairness research in FL, we introduce FeDa4Fair, the first benchmarking framework designed to stress-test fairness methods under these heterogeneous conditions. Our contributions are three-fold: (1) We introduce FeDa4Fair, a library designed to create datasets tailored to evaluating fair FL methods under heterogeneous client bias; (2) we release a benchmark suite generated by the FeDa4Fair library to standardize the evaluation of fair FL methods; (3) we provide ready-to-use functions for evaluating fairness outcomes for these datasets.

2506.12542 2026-05-12 cs.LG cs.AI cs.CV stat.ML

PLD: A Choice-Theoretic List-Wise Knowledge Distillation

Ejafa Bassam, Dawei Zhu, Kaigui Bian

AI总结 本文提出了一种基于选择理论的知识蒸馏方法PLD,将教师网络的logit值解释为类别“价值”得分,并在Plackett-Luce模型框架下构建了一个加权列表级排序损失函数。PLD直接优化教师模型的完整排序结构,将真实标签置于首位,其余类别按教师置信度降序排列,从而生成一个凸且平移不变的替代损失函数。实验表明,PLD在多个数据集和不同架构的师生对中均能实现稳定提升,适用于多种蒸馏目标。

详情
Journal ref
Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 136090--136112 (2026)
英文摘要

Knowledge distillation is a model compression technique in which a compact "student" network is trained to replicate the predictive behavior of a larger "teacher" network. In logit-based knowledge distillation, it has become the de facto approach to augment cross-entropy with a distillation term. Typically, this term is either a KL divergence that matches marginal probabilities or a correlation-based loss that captures intra- and inter-class relationships. In every case, it acts as an additional term to cross-entropy. This term has its own weight, which must be carefully tuned. In this paper, we adopt a choice-theoretic perspective and recast knowledge distillation under the Plackett-Luce model by interpreting teacher logits as "worth" scores. We introduce "Plackett-Luce Distillation (PLD)", a weighted list-wise ranking loss. In PLD, the teacher model transfers knowledge of its full ranking of classes, weighting each ranked choice by its own confidence. PLD directly optimizes a single "teacher-optimal" ranking. The true label is placed first, followed by the remaining classes in descending teacher confidence. This process yields a convex and translation-invariant surrogate that subsumes weighted cross-entropy. Empirically, across CIFAR-100, ImageNet-1K, and MS-COCO, PLD achieves consistent gains across diverse architectures and distillation objectives, including divergence-based, correlation-based, and feature-based methods, in both homogeneous and heterogeneous teacher-student pairs.

2506.07436 2026-05-12 cs.CV cs.AI cs.ET

Prompt to Protection: A Comparative Study of Multimodal LLMs in Construction Hazard Recognition

Nishi Chaudhary, S M Jamil Uddin, Sathvik Sharath Chandra, Anto Ovid, Alex Albert

AI总结 本文对比研究了五种先进的多模态大语言模型在建筑工地危险识别任务中的表现,探讨了不同提示策略对模型性能的影响。研究采用零样本、少样本和思维链(CoT)三种提示方式,发现CoT策略显著提升了模型的识别准确率,且不同模型在不同条件下表现各异,其中GPT-4.5和GPT-o3表现较为突出。研究强调了提示设计在提升多模态大语言模型安全应用性能中的关键作用,为构建更可靠的AI辅助安全系统提供了实用参考。

详情
英文摘要

The recent emergence of multimodal large language models (LLMs) has introduced new opportunities for improving visual hazard recognition on construction sites. Unlike traditional computer vision models that rely on domain-specific training and extensive datasets, modern LLMs can interpret and describe complex visual scenes using simple natural language prompts. However, despite growing interest in their applications, there has been limited investigation into how different LLMs perform in safety-critical visual tasks within the construction domain. To address this gap, this study conducts a comparative evaluation of five state-of-the-art LLMs: Claude-3 Opus, GPT-4.5, GPT-4o, GPT-o3, and Gemini 2.0 Pro, to assess their ability to identify potential hazards from real-world construction images. Each model was tested under three prompting strategies: zero-shot, few-shot, and chain-of-thought (CoT). Zero-shot prompting involved minimal instruction, few-shot incorporated basic safety context and a hazard source mnemonic, and CoT provided step-by-step reasoning examples to scaffold model thinking. Quantitative analysis was performed using precision, recall, and F1-score metrics across all conditions. Results reveal that prompting strategy significantly influenced performance, with CoT prompting consistently producing higher accuracy across models. Additionally, LLM performance varied under different conditions, with GPT-4.5 and GPT-o3 outperforming others in most settings. The findings also demonstrate the critical role of prompt design in enhancing the accuracy and consistency of multimodal LLMs for construction safety applications. This study offers actionable insights into the integration of prompt engineering and LLMs for practical hazard recognition, contributing to the development of more reliable AI-assisted safety systems.

2506.01352 2026-05-12 cs.LG

TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network

Guangxin He, Yuan Cao, Yutong He, Tianyi Bai, Kai Chen, Kun Yuan, Binhang Yuan

AI总结 在分布式训练大语言模型中,流水线并行模式下的网络通信瓶颈限制了训练效率。本文提出了一种名为 TAH-Quant 的激活量化框架,通过基于瓦尔德变换的分块自适应量化方法,有效降低中间激活的通信开销,同时保持训练收敛性。实验表明,TAH-Quant 在保证模型性能的前提下,实现了 3-4 位的激活量化,相比传统方法在吞吐量和训练速度上均有显著提升。

详情
英文摘要

Decentralized training of large language models offers the opportunity to pool computational resources across geographically distributed participants, but is often bottlenecked by network communication, particularly under pipeline parallel settings. While pipeline parallelism partitions model layers across devices to handle large-scale models, it necessitates frequent communication of intermediate activations, creating challenges when network bandwidth is limited. To address these issues, we propose TAH-Quant (Tile-wise Adaptive Hadamard Quantization), a novel activation quantization framework for pipeline parallelism. TAH-Quant integrates fine-grained tile-wise quantization, entropy-guided tile-wise adaptive bit allocation for optimal bit usage, and a Hadamard-based transformation with pivot swapping to effectively suppress outliers. Compared with token-level allocation, the tile-wise allocator assigns precision at the granularity of small channel windows within each token, reducing quantization error under the same bit budget. We prove that pipeline parallel training equipped with TAH-Quant maintains a convergence rate of O(1/sqrt(T)), matching that of vanilla stochastic gradient descent. Extensive experiments demonstrate that TAH-Quant achieves an aggressive activation quantization ratio of 3-4 bits, providing up to 4.3x throughput speedup over uncompressed FP32 and up to 1.33x wall-clock speedup over AQ-SGD, while preserving training convergence, avoiding AQ-SGD's activation-cache overhead, and generalizing well across various training scenarios.

2506.01301 2026-05-12 cs.AI cs.CL

Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner

Chunhui Zhang, Zhongyu Ouyang, Kwonjoon Lee, Nakul Agarwal, Sean Dae Houlihan, Soroush Vosoughi, Shao-Yuan Lo

AI总结 该研究针对多模态心智理论(ToM)推理中多步骤复杂性的问题,提出了一种可扩展的贝叶斯规划方法,将ToM推理分解为逐步的贝叶斯更新过程。通过引入弱到强的控制机制,使小型语言模型专注于ToM相关的似然估计,并将其推理能力迁移至更大的语言模型中,从而有效整合社交与世界知识。实验表明,该方法在多模态ToM基准测试中相比现有技术提升了4.6%的准确率,尤其在复杂和未见过的场景中表现突出。

Comments Accepted as a Spotlight at the 2025 Forty-Second International Conference on Machine Learning (ICML 2025)

详情
英文摘要

Theory-of-Mind (ToM) enables humans to infer mental states-such as beliefs, desires, and intentions-forming the foundation of social cognition. However, existing computational ToM methods rely on structured workflows with ToM-specific priors or deep model fine-tuning, which struggle with scalability in multimodal environments and fail to generalize as task complexity increases. To address these limitations, we propose a scalable Bayesian ToM planner that decomposes ToM reasoning into stepwise Bayesian updates. Our framework introduces weak-to-strong control, allowing smaller language models (LMs) to specialize in ToM-specific likelihood estimation and transfer their reasoning behaviors to larger LMs (7B to 405B) for integration with social and world knowledge. This synergistic approach aligns large-model inference of human mental states with Bayesian principles. Extensive experiments show that our method achieves a 4.6% accuracy improvement over state-of-the-art techniques on multimodal ToM benchmarks, including challenging unseen scenarios, thereby establishing a new standard for modeling human mental states in complex environments.

2505.16741 2026-05-12 cs.LG math.OC stat.ML

Meta-reinforcement learning with minimum attention

Shashank Gupta, Pilhwa Lee

AI总结 该论文将最小注意原理应用于强化学习,通过在奖励函数中引入最小注意正则化,旨在提升智能体在高维非线性动态环境中的学习效率和稳定性。研究结合模型基于的元学习框架,交替进行模型学习与元策略优化,实验表明该方法在少量样本下的适应能力和对模型与环境扰动的鲁棒性方面优于现有先进算法,并在能量效率方面也表现出改进。

Comments 30 pages, 22 figures

详情
英文摘要

Minimum attention applies the least action principle to changes of control concerning state and time, first proposed by Brockett. The involved regularization is highly relevant in emulating biological control, such as motor learning. We apply minimum attention in reinforcement learning (RL) as part of the rewards and investigate its connection to meta-learning and stabilization. Specifically, model-based meta-learning with minimum attention is explored in high-dimensional nonlinear dynamics. Ensemble-based model learning and gradient-based meta-policy learning are alternately performed. Empirically, the minimum attention does show outperforming competence in comparison to the state-of-the-art algorithms of model-free and model-based RL, i.e., fast adaptation in few shots and variance reduction from the perturbations of the model and environment. Furthermore, the minimum attention demonstrates an improvement in energy efficiency.

2505.16025 2026-05-12 cs.CV cs.MM eess.IV

Context and Pixel Aware Large Language Model for Video Quality Assessment

Wen Wen, Yaohong Wu, Yue Sheng, Neil Birkbeck, Balu Adsumilli, Yilin Wang

AI总结 视频质量评估(VQA)是一个具有广泛应用的挑战性研究课题。为了解决传统模型在上下文理解与像素级失真感知上的不足,以及近期多模态大语言模型在敏感度和任务分离上的问题,本文提出了一种具有上下文和像素感知能力的大语言模型CP-LLM。该模型采用双视觉编码器架构,分别从高层视频语义和底层像素失真两个层面进行感知分析,并通过语言解码器对两者进行融合推理,从而实现对视频质量的鲁棒评分与可解释描述,实验表明其在多个基准测试中表现出色,尤其在像素失真敏感性方面有显著提升。

Comments Accepted to ICIP 2026

详情
英文摘要

Video quality assessment (VQA) is a challenging research topic with broad applications. Traditional hand-crafted and discriminative learning-based VQA models mainly focus on pixel-level distortions and lack contextual understanding, while recent multimodal large language models (MLLMs) struggle with sensitivity to small distortions or handle quality scoring and description as separate tasks. To address these shortcomings, we introduce CP-LLM: a Context- and Pixel-aware Large Language Model. CP-LLM is a novel multimodal LLM architecture featuring dual vision encoders designed to independently analyze perceptual quality at both high-level (video context) and low-level (pixel distortion) granularity, along with a language decoder that subsequently reasons about the interplay between these aspects. This design enables CP-LLM to simultaneously produce robust quality scores and interpretable quality descriptions, with enhanced sensitivity to pixel distortions (e.g., compression artifacts). Experiment results demonstrate that CP-LLM achieves state-of-the-art cross-dataset performance on VQA benchmarks and superior robustness to pixel distortions.

2505.15879 2026-05-12 cs.CV cs.AI cs.CL

GRIT: Teaching MLLMs to Think with Images

Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, Xin Eric Wang

AI总结 本文提出了一种名为GRIT的新方法,旨在训练多模态大语言模型(MLLMs)在视觉语言任务中结合图像进行推理。GRIT引入了一种基于图像和文本的地面推理范式,使模型在生成推理链时能够交替使用自然语言和明确的图像边界框坐标,从而实现视觉信息的显式整合。通过结合强化学习算法GRPO-GR,GRIT无需标注推理链或边界框标签即可高效训练,仅需少量数据即可显著提升模型生成连贯且视觉支撑的推理链的能力。

详情
Journal ref
NeurIPS 2025
英文摘要

Recent studies have demonstrated the efficacy of using Reinforcement Learning (RL) in building reasoning models that articulate chains of thoughts prior to producing final answers. However, despite ongoing advances that aim at enabling reasoning for vision-language tasks, existing open-source visual reasoning models typically generate reasoning content with pure natural language, lacking explicit integration of visual information. This limits their ability to produce clearly articulated and visually grounded reasoning chains. To this end, we propose Grounded Reasoning with Images and Texts (GRIT), a novel method for training MLLMs to think with images. GRIT introduces a grounded reasoning paradigm, in which models generate reasoning chains that interleave natural language and explicit bounding box coordinates. These coordinates point to regions of the input image that the model consults during its reasoning process. Additionally, GRIT is equipped with a reinforcement learning approach, GRPO-GR, built upon the GRPO algorithm. GRPO-GR employs robust rewards focused on the final answer accuracy and format of the grounded reasoning output, which eliminates the need for data with reasoning chain annotations or explicit bounding box labels. As a result, GRIT achieves exceptional data efficiency, requiring as few as 20 image-question-answer triplets from existing datasets. Comprehensive evaluations demonstrate that GRIT effectively trains MLLMs to produce coherent and visually grounded reasoning chains, showing a successful unification of reasoning and grounding abilities.

2505.12437 2026-05-12 cs.LG cs.AI

A method for the systematic generation of graph XAI benchmarks via Weisfeiler-Leman coloring

Michele Fontanesi, Alessio Micheli, Marco Podda, Domenico Tortorella

AI总结 该论文提出了一种系统生成图可解释性(Graph XAI)基准的方法,旨在解决图神经网络(GNN)决策过程不透明的问题。研究利用Weisfeiler-Leman颜色精炼算法从通用图分类数据集中自动构建基准,挖掘具有判别性的子图模式作为解释的代理真实标签,并确保这些模式可被GNN学习。该方法生成了包含15个数据集的OpenGraphXAI基准套件,并提供了生成数千个额外基准的代码,为图解释器的评估提供了更全面和可重复的实验平台。

详情
Journal ref
Data Mining and Knowledge Discovery, vol. 40(4), article no. 42 (2026)
英文摘要

Graph neural networks have become the de facto model for learning from structured data. However, the decision-making process of GNNs remains opaque to the end user, which undermines their use in safety-critical applications. Several explainable AI techniques for graphs have been developed to address this major issue. Focusing on graph classification, these explainers identify subgraph motifs that explain predictions. Therefore, a robust benchmarking of graph explainers is required to ensure that the produced explanations are of high quality, i.e., aligned with the GNN's decision process. However, current graph-XAI benchmarks are limited to simplistic synthetic datasets or a few real-world tasks curated by domain experts, hindering rigorous and reproducible evaluation, and consequently stalling progress in the field. To overcome these limitations, we propose a method to automate the construction of graph XAI benchmarks from generic graph classification datasets. Our approach leverages the Weisfeiler-Leman color refinement algorithm to efficiently perform approximate subgraph matching and mine class-discriminating motifs, which serve as proxy ground-truth class explanations. At the same time, we ensure that these motifs can be learned by GNNs because their discriminating power aligns with WL expressiveness. This work also introduces the OpenGraphXAI benchmark suite, which consists of 15 ready-made graph-XAI datasets derived by applying our method to real-world molecular classification datasets. The suite is available to the public along with a codebase to generate over 2,000 additional graph-XAI benchmarks. Finally, we present a use case that illustrates how the suite can be used to assess the effectiveness of a selection of popular graph explainers, demonstrating the critical role of a sufficiently large benchmark collection for improving the significance of experimental results.

2505.10872 2026-05-12 cs.RO cs.AI cs.CL

REI-Bench: Can Embodied Agents Understand Vague Human Instructions in Task Planning?

Chenxi Jiang, Chuhao Zhou, Jianfei Yang

AI总结 本文提出REI-Bench,一个专门用于评估具身智能体在含模糊指称表达(REs)的人类指令下任务规划能力的基准。研究发现,这类模糊性会显著降低机器人任务规划的成功率,最高可达36.9%。为解决这一问题,作者提出了一种基于任务导向语境认知的方法,通过生成清晰指令有效提升了规划性能,为非专家用户(如老人和儿童)更友好地使用机器人提供了支持。

Comments Accepted at ICLR 2026

详情
英文摘要

Robot task planning decomposes human instructions into executable action sequences that enable robots to complete a series of complex tasks. Although recent large language model (LLM)-based task planners achieve amazing performance, they assume that human instructions are clear and straightforward. However, real-world users are not experts, and their instructions to robots often contain significant vagueness. Linguists suggest that such vagueness frequently arises from referring expressions (REs), whose meanings depend heavily on dialogue context and environment. This vagueness is even more prevalent among the elderly and children, who are the groups that robots should serve more. This paper studies how such vagueness in REs within human instructions affects LLM-based robot task planning and how to overcome this issue. To this end, we propose the first robot task planning benchmark that systematically models vague REs grounded in pragmatic theory (REI-Bench), where we discover that the vagueness of REs can severely degrade robot planning performance, leading to success rate drops of up to 36.9%. We also observe that most failure cases stem from missing objects in planners. To mitigate the REs issue, we propose a simple yet effective approach: task-oriented context cognition, which generates clear instructions for robots, achieving state-of-the-art performance compared to aware prompts, chains of thought, and in-context learning. By tackling the overlooked issue of vagueness, this work contributes to the research community by advancing real-world task planning and making robots more accessible to non-expert users, e.g., the elderly and children.

2505.02184 2026-05-12 cs.AI cs.DC cs.PL cs.SE

Leveraging LLMs to Automate Energy-Aware Refactoring of Parallel Scientific Codes

Matthew T. Dearing, Yiheng Tao, Xingfu Wu, Zhiling Lan, Valerie Taylor

AI总结 本文研究如何利用大语言模型(LLMs)在反馈引导下自动生成能量高效的并行科学代码。为此,提出了一种名为LASSI-EE的自动化重构框架,结合运行时功耗分析、能量感知提示、自我修正反馈机制以及LLM作为评判者的策略,实现迭代优化。实验结果表明,该方法在两种GPU平台上分别平均降低了36%和34%的能耗,展示了其在提升代码能效方面的有效性。

Comments 12 pages, 5 figures, version under review at a peer-reviewed conference

详情
英文摘要

Large language models (LLMs) are increasingly used for generating parallel scientific codes, with a primary focus on generating functionally correct code. Recent work has focused on generating performant code, with an emphasis on its execution time. However, energy efficiency is now recognized as a critical objective, given the significant power demands of large-scale compute systems. This paper addresses the research question of whether LLMs can generate energy-efficient parallel scientific codes when guided by empirical execution feedback. To answer this question, we propose LASSI-EE, an automated LLM-based refactoring framework that generates energy-efficient parallel codes through a multi-stage, iterative approach integrating runtime power profiling, energy-aware prompting, self-correcting feedback loops, and an LLM-as-a-Judge agent for screening generated code. We evaluate LASSI-EE using twenty-two representative scientific benchmarks and applications on NVIDIA A100 and AMD MI100 GPUs. The results indicate an average energy reduction of 36% for MI100 and 34% for A100, across trials that produced passing energy-reducing refactorings.

2504.14697 2026-05-12 cs.LG math.AP math.DS stat.ML

Quantitative Clustering in Mean-Field Transformer Models

Shi Chen, Zhengjiang Lin, Yury Polyanskiy, Philippe Rigollet

AI总结 本文研究了平均场变换器模型中令牌的长期聚类行为,揭示了在适当参数假设下,模型会以指数速率收敛到一个狄拉克点质量。作者通过定量分析给出了明确的收敛速率,为理解变换器模型中的同步现象提供了理论依据。

Comments 50 pages, 4 figures; We have updated the introduction and added sketches of the proofs of the main theorems

详情
英文摘要

The evolution of tokens through deep transformer models can be modeled as an interacting particle system that has been shown to exhibit an asymptotic clustering behavior akin to the synchronization phenomenon in Kuramoto models. In this work, we investigate the long-time clustering of mean-field transformer models. More precisely, under suitable assumptions on the transformer model parameters, we establish that any suitably regular mean-field initialization synchronizes exponentially fast to a Dirac point mass, with explicit quantitative convergence rates.

2504.14044 2026-05-12 cs.AI cs.CR

Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy

Regan Bolton, Mohammadreza Sheikhfathollahi, Simon Parkinson, Dan Basher, Howard Parkinson

AI总结 本文研究了如何利用大语言模型(LLM)和多阶段检索技术提升铁路等关键基础设施在操作技术网络安全(OTCS)合规性验证的效率与准确性。提出了一种基于多阶段检索的合规架构,通过引入监管标准的额外上下文,显著提高了合规判断的正确性和推理质量。实验表明,该方法在应对网络安全标准如IEC 62443和IEC 63452时,相比基线方法具有明显优势,为缺乏网络安全专业人才的行业提供了有效的合规评估工具。

详情
英文摘要

Operational Technology Cybersecurity (OTCS) continues to be a dominant challenge for critical infrastructure such as railways. As these systems become increasingly vulnerable to malicious attacks due to digitalization, effective documentation and compliance processes are essential to protect these safety-critical systems. This paper proposes a novel system that leverages Large Language Models (LLMs) and multi-stage retrieval to enhance the compliance verification process against standards like IEC 62443 and the rail-specific IEC 63452. We first evaluate a Baseline Compliance Architecture (BCA) for answering OTCS compliance queries, then develop an extended approach called Parallel Compliance Architecture (PCA) that incorporates additional context from regulatory standards. Through empirical evaluation comparing OpenAI-gpt-4o and Claude-3.5-haiku models in these architectures, we demonstrate that the PCA significantly improves both correctness and reasoning quality in compliance verification. Our research establishes metrics for response correctness, logical reasoning, and hallucination detection, highlighting the strengths and limitations of using LLMs for compliance verification in railway cybersecurity. The results suggest that retrieval-augmented approaches can significantly improve the efficiency and accuracy of compliance assessments, particularly valuable in an industry facing a shortage of cybersecurity expertise.

2504.12501 2026-05-12 cs.LG

Reinforcement Learning from Human Feedback

Nathan Lambert

AI总结 《从人类反馈中学习强化学习》一书系统介绍了基于人类反馈的强化学习(RLHF)的核心方法,旨在为具有定量背景的读者提供温和而全面的引导。书中从RLHF的起源出发,涵盖问题定义、数据收集、数学基础,并详细阐述了从指令调优到奖励模型训练、拒绝采样、强化学习及直接对齐算法等关键优化阶段。最后,书中还探讨了合成数据与评估等尚未深入研究的前沿问题,为该领域的发展提供了开放性思考。

Comments 229 pages. Web-native version at https://rlhfbook.com/ Continually improving, latest version at website

详情
英文摘要

Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF -- both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics -- understudied research questions in synthetic data and evaluation -- and open questions for the field.

2503.18273 2026-05-12 cs.LG

Decoding Islamophobic Discourse: Using LLMs to Identify Tropes and Semi-Coded Hate Speech

Raza Ul Mustafa, Roi Dupart, Gabrielle Smith, Noman Ashraf, Nathalie Japkowicz

AI总结 本文研究了在西方社会日益严重的伊斯兰恐惧症话语,特别是通过分析极端社交平台上的半编码术语(如muzrat、pislam等)来识别其隐含的仇恨言论。研究利用大型语言模型(LLMs)和BERT主题建模方法,揭示了这些术语在特定语境下的仇恨属性,并发现伊斯兰恐惧症内容在毒性评分上高于其他类型的仇恨言论。研究还表明,尽管LLMs能够理解这些超出词汇表的侮辱性词语,但当前的审核策略和算法检测仍需进一步改进,以更有效地应对此类话语的传播。

详情
英文摘要

In recent years, Islamophobia has gained significant traction across Western societies, fueled by the rise of digital communication networks. This paper performs a large-scale analysis of specialized, semi-coded Islamophobic terms such as (muzrat, pislam, mudslime, mohammedan, muzzies) floated on extremist social platforms, i.e., 4Chan, Gab, Telegram, etc. Many of these terms appear lexically neutral or ambiguous outside of specific contexts, making them difficult for both human moderators and automated systems to reliably identify as hate speech. First, we use Large Language Models (LLMs) to show their ability to understand these terms. Second, Google Perspective API suggests that Islamophobic posts tend to receive higher toxicity scores than other categories of hate speech like Antisemitism. Finally, we use BERT topic modeling approach to extract different topics and Islamophobic discourse on these social platforms. Our findings indicate that LLMs understand these Out-Of-Vocabulary (OOV) slurs; however, further improvements in moderation strategies and algorithmic detection are necessary to address such discourse effectively. Our topic modeling also indicates that Islamophobic text is found across various political, conspiratorial, and far-right movements and is particularly directed against Muslim immigrants. Taken altogether, we performed one of the first studies on Islamophobic semi-coded terms and shed a global light on Islamophobia.

2503.12333 2026-05-12 cs.RO cs.MA

GameChat: Multi-LLM Dialogue for Safe, Agile, and Socially Optimal Multi-Agent Navigation in Constrained Environments

Vagul Mahadevan, Shangtong Zhang, Rohan Chandra

AI总结 在拥挤和受限环境中实现安全、敏捷且符合社会规范的多智能体导航仍是一个重大挑战,尤其在去中心化场景中,各智能体具有不同的未知优先级且缺乏中央协调机构。为此,研究提出了一种名为GameChat的方法,通过让智能体使用自然语言进行自主沟通,以解决冲突并实现高效导航。实验表明,该方法在多种场景下显著提升了导航效率和优先级任务的完成率,展示了其在多智能体系统中的有效性和扩展性。

详情
Journal ref
2025 IEEE International Symposium on Multi-Robot and Multi-Agent Systems (MRS), 2025, pp. 1-7
英文摘要

Safe, agile, and socially compliant multi-robot navigation in cluttered and constrained environments remains a critical challenge. This is especially difficult with self-interested agents with unique, unknown priorities in decentralized settings, where there is no central authority to resolve conflicts induced by spatial symmetry. We address this challenge by proposing an intuitive, but very effective approach, GameChat, which facilitates safe, agile, and deadlock-free navigation for both cooperative and self-interested agents in cluttered environments. Key to our approach is the idea that agents should resolve conflicts on their own using natural language to communicate, much like humans. We evaluate GameChat in simulated environments with doorways and intersections. The results show that even in the worst case, GameChat reduces the time for all agents to reach their goals by over 35% from a naive baseline and by over 20% from a state of the art baseline in the intersection scenario, while doubling the rate of ensuring the agent with a higher priority task reaches the goal first, from 50% (equivalent to random chance) to 100%. We also demonstrate how GameChat can be extended to more than two agents.

2503.09158 2026-05-12 cs.CV

FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO

Fufangchen Zhao, Songbai Tan, Xuerui Qiu, Linrui Xun, Wenhao Jiang, Jinkai Zheng, Hehe Fan, Jian Gao, Danfeng Yan, Ming Li

AI总结 现有视频大语言模型在处理面部视频理解任务时,往往无法有效捕捉与问题相关的细微面部线索。为此,本文提出FaVChat,一种基于分层提示引导的视频理解模型,通过在三个互补层次上强调问题相关的信息,提升了对细微面部特征的推理能力。此外,研究还引入了数据高效的GRPO强化学习策略,以在数据稀缺的情况下提升模型性能,并构建了包含6万段高质量面部视频和17万问答对的FaVChat 170K基准数据集,实验表明该方法在多项面部理解任务中均优于现有模型。

详情
英文摘要

Existing video large language models (VLLMs) primarily leverage prompt agnostic visual encoders, which extract untargeted facial representations without awareness of the queried information, leading to the loss of task critical cues. To address this challenge, we propose FaVChat, the first VLLM designed for reasoning over subtle visual and dynamic facial cues. FaVChat introduces a hierarchical, prompt guided visual feature extraction framework that emphasizes question relevant information at three complementary levels. These multi level features are dynamically fused and injected into the LLM, enabling more accurate facial details reasoning To further improve learning efficiency under data scarcity, we propose Data Efficient GRPO, a reinforcement learning strategy that iteratively identifies high utility samples and maximizes the contribution of each instance via per instance utility estimation, substantially enhancing performance gains under limited supervision. We construct a large scale benchmark dataset FaVChat 170K, comprising approximately 60K high quality facial videos and 170K question answer pairs focusing on fine grained facial details. Extensive experiments, including zero shot evaluations on four facial understanding tasks, demonstrate that FaVChat consistently outperforms existing VLLMs.

2503.06047 2026-05-12 cs.AI cs.CL

DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments

Wenjie Tang, Yuan Zhou, Erqiang Xu, Keyan Cheng, Minne Li, Liquan Xiao

AI总结 DSGBench 是一个用于评估基于大语言模型(LLM)的智能体在复杂决策环境中的表现的多样化战略博弈基准平台。该平台引入了六个复杂战略游戏,涵盖长期和多维决策需求,并支持不同难度和目标的任务定制。DSGBench 采用细粒度评分系统,从五个具体维度评估智能体的决策能力,同时引入自动决策追踪机制,深入分析智能体的行为模式和策略转折点,为模型选择和未来智能体开发提供了有价值的参考。

Comments 43 pages, 5 figures, conference

详情
英文摘要

Large language model (LLM)-based agents are increasingly applied to complex strategic environments that demand long-horizon reasoning, multi-agent interaction, and decision-making under uncertainty. However, common existing benchmarks either assess isolated skills, lack environmental diversity, or rely on broad overall metrics. To address these issues, we introduce DSGBench, a more rigorous evaluation platform for strategic decision-making tasks. Firstly, it incorporates six complex strategic games which serve as ideal testbeds due to their long-term and multi-dimensional decision-making demands and flexibility in customizing tasks with various difficulty levels and targets. Secondly, DSGBench employs a fine-grained evaluation scoring system which examines the decision-making capabilities by looking into the performance in five specific dimensions, offering a comprehensive assessment in a better-designed fashion. Furthermore, DSGBench also incorporates an automated decision-tracking mechanism which enables in-depth analysis of agent behaviour patterns and the turning points in their strategies. We evaluate six popular LLM agents, including open-source and closed-source models, and observe distinct strengths and limitations among various tasks. Through decision trajectory analysis, we further identify systemic limitations in different LLMs. These findings offer valuable insights for model selection and future LLM-based agent development.

2502.13451 2026-05-12 cs.RO

MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation

Lingfeng Zhang, Xiaoshuai Hao, Qinwen Xu, Qiang Zhang, Xinyao Zhang, Pengwei Wang, Jing Zhang, Zhongyuan Wang, Shanghang Zhang, Renjing Xu

AI总结 本文提出了一种名为MapNav的全新端到端视觉-语言导航模型,通过引入带注释的语义地图(ASM)替代传统方法中依赖的历史帧,有效降低了存储和计算开销。该方法在每个任务开始时构建顶视图语义地图,并在每一步更新地图,结合显式的文本标签增强导航信息,从而生成结构化且易于理解的导航线索。实验表明,MapNav在模拟和真实环境中均取得了最先进的性能,并开源了ASM生成代码和数据集,为未来研究提供了重要资源。

详情
英文摘要

Vision-and-language navigation (VLN) is a key task in Embodied AI, requiring agents to navigate diverse and unseen environments while following natural language instructions. Traditional approaches rely heavily on historical observations as spatio-temporal contexts for decision making, leading to significant storage and computational overhead. In this paper, we introduce MapNav, a novel end-to-end VLN model that leverages Annotated Semantic Map (ASM) to replace historical frames. Specifically, our approach constructs a top-down semantic map at the start of each episode and update it at each timestep, allowing for precise object mapping and structured navigation information. Then, we enhance this map with explicit textual labels for key regions, transforming abstract semantics into clear navigation cues and generate our ASM. MapNav agent using the constructed ASM as input, and use the powerful end-to-end capabilities of VLM to empower VLN. Extensive experiments demonstrate that MapNav achieves state-of-the-art (SOTA) performance in both simulated and real-world environments, validating the effectiveness of our method. Moreover, we will release our ASM generation source code and dataset to ensure reproducibility, contributing valuable resources to the field. We believe that our proposed MapNav can be used as a new memory representation method in VLN, paving the way for future research in this field.

2502.07553 2026-05-12 cs.LG

Transformers Provably Learn Sparse XOR with Polylogarithmic Parameters

Yaomengxi Han, Debarghya Ghoshdastidar

AI总结 本文研究了Transformer模型在学习稀疏异或(XOR)函数问题中的能力,证明了单层双头Transformer仅需对数级别的参数即可成功识别相关特征,并在一次梯度更新后将所有输入的损失降至接近零。该结果突破了传统前馈神经网络(FFNN)在该问题中所需的线性参数瓶颈。此外,实验表明,Transformer的快速特征发现能力源于其精确的softmax注意力机制,优于线性或逐分量注意力等替代方案。

详情
英文摘要

Learning sparse parity functions has become a theoretical testbed for studying feature learning in neural networks. However, existing analyses primarily focus on Feed-Forward Neural Networks (FFNNs). Meanwhile, theoretical understanding of Transformers in this setting remains limited, despite their empirical success and structural suitability for discovering sparse support over long sequences. To address this gap, we analyze how a single-layer, two-head Transformer learns the sparse XOR problem. Considering samples $(\mathbf{x}, y) \in \lbrace\pm 1\rbrace^d \times \lbrace\pm 1\rbrace$, where the label is defined by $y = -x_{i^*} x_{j^*}$ for some unknown $i^*, j^* \in [d]$, we prove that, with only $O(\mathrm{polylog}(d))$ trainable parameters, Transformers can successfully discover the relevant features and drive the loss for every input to nearly 0 with one gradient step. This result establishes that Transformers break the fundamental $Ω(d)$ parameter bottleneck inherent to FFNNs for this problem. Furthermore, we empirically show that this rapid feature discovery is uniquely driven by the exact softmax attention, outperforming common substitutes such as linear or component-wise attention. Finally, we provide a theoretical sample complexity bound for learning from finite data, demonstrating the generalization ability of Transformers in this task.

2502.06818 2026-05-12 cs.LG

Rethinking the Global Knowledge of CLIP in Training-Free Open-Vocabulary Semantic Segmentation

Jingyun Wang, Cilin Yan, Guoliang Kang

AI总结 本文研究了如何在无需训练的情况下利用CLIP的全局知识进行开放词汇语义分割。与现有方法牺牲全局性以增强局部特征不同,作者重新思考CLIP中编码的全局信息,并提出GCLIP方法,通过重塑最后一层注意力机制和值嵌入,有效整合有用的全局上下文信息。实验表明,该方法在多个基准数据集上显著优于现有最先进方法。

Comments TMM 2026

详情
英文摘要

Recent works modify CLIP to perform open-vocabulary semantic segmentation in a training-free manner (TF-OVSS). In vanilla CLIP, patch-wise image representations mainly encode homogeneous image-level properties, which hinders the application of CLIP to the dense prediction task. Previous TF-OVSS works sacrifice globality to enhance the locality of CLIP features, by making each patch mainly attend to itself or its neighboring patches within a narrow local window. With their modifications,the ability of CLIP to aggregate global context information is largely weakened. Differently, in this paper, we rethink the global knowledge encoded by CLIP and propose GCLIP to answer how to extract and utilize beneficial global knowledge of CLIP for TF-OVSS. As the representation of each patch is finally determined by the attention weights and the Value embeddings, we propose to reshape the last-block attention and Value embeddings to aggregate useful global context into final features. Firstly, we aim to equip the last-block attention with image-level properties while not introducing homogeneous attention patterns across patches. To realize the goal, we fuse the attention from the global-token emerging blocks with the Query-Query attention. Secondly, we aim to make Value embeddings of the last-block attention module more semantically correlated. To realize this, we design a novel channel suppression strategy.Extensive experiments on five standard benchmarks demonstrate that our method consistently outperforms previous state-of-the-arts.

2502.00816 2026-05-12 cs.LG

Sundial: A Family of Highly Capable Time Series Foundation Models

Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, Mingsheng Long

AI总结 本文提出了一种名为 Sundial 的时间序列基础模型家族,能够直接处理连续值时间序列,无需离散分词。通过引入基于流匹配的 TimeFlow Loss,模型在预训练过程中实现了更灵活的表示学习,并能生成多种可能的预测结果。Sundial 在大规模真实世界和合成数据集 TimeBench 上进行预训练,表现出卓越的扩展性和泛化能力,在点预测和概率预测基准测试中均取得了最先进的性能。

详情
英文摘要

We introduce Sundial, a family of native, flexible, and scalable time series foundation models. To predict the next-patch's distribution, we propose a TimeFlow Loss based on flow-matching, which facilitates native pre-training of Transformers on continuous-valued time series without discrete tokenization. Conditioned on arbitrary-length time series, our models are pre-trained without specifying any prior distribution and can generate multiple probable predictions, achieving more flexibility in representation learning than using parametric densities. Towards time series foundation models, we leverage minimal but crucial adaptations of Transformers and curate TimeBench with one trillion time points, comprising mostly real-world datasets and synthetic data. By mitigating mode collapse via TimeFlow Loss, we pre-train a family of Sundial models on TimeBench, which achieve unprecedented model capacity and generalization performance. In addition to excellent scalability, Sundial achieves state-of-the-art results on both point and probabilistic forecasting benchmarks with a just-in-time inference speed, i.e., making zero-shot predictions within a few milliseconds. We believe that Sundial's pioneering generative forecasting capability can improve model reliability in real-world decision-making. Code is available at: https://github.com/thuml/Sundial.

2501.03544 2026-05-12 cs.CV cs.AI cs.CR

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Lingzhi Yuan, Xinfeng Li, Chejian Xu, Guanhong Tao, Xiaojun Jia, Yihao Huang, Wei Dong, Yang Liu, Bo Li

AI总结 近年来,文本到图像(T2I)模型在生成高质量图像方面表现出色,但其易被用于生成不适宜内容(如色情、暴力等),引发严重伦理问题。为此,本文提出PromptGuard,一种基于软提示引导的内容审核技术,通过在T2I模型的文本嵌入空间中优化一个通用安全软提示,实现对不安全输入的有效抑制,从而生成安全且真实的图像。该方法无需牺牲推理效率或引入代理模型,实验表明其在多个数据集上均能有效降低不安全内容生成,且性能优于现有方法。

Comments Accepted for publication in IEEE Transactions on Information Forensics and Security (TIFS)

详情
英文摘要

Recent text-to-image (T2I) models have exhibited remarkable performance in generating high-quality images from text descriptions. However, these models are vulnerable to misuse, particularly generating not-safe-for-work (NSFW) content, such as sexually explicit, violent, political, and disturbing images, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model's textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without affecting inference efficiency or requiring proxy models. We further enhance its reliability and helpfulness through a divide-and-conquer strategy that optimizes category-specific soft prompts and combines them into unified safety guidance. Extensive experiments across five datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard is 3.8 times faster than prior content moderation methods while outperforming eight state-of-the-art defenses. Evaluations using both a multi-head safety classifier and a VLM-based guardrail further confirm its robustness, with average unsafe ratios of 5.84% and 6.18%, respectively. Our code and dataset are available at https://t2i-promptguard.github.io/.

2412.18798 2026-05-12 cs.LG cs.AI

Ister: Linear Transformer for Efficient Multivariate Time Series Forecasting

Fanpu Cao, Shu Yang, Zhengjian Chen, Ye Liu, Laizhong Cui

AI总结 本文提出了一种名为Ister的线性Transformer模型,用于高效多变量时间序列预测。该模型通过引入点积注意力机制,将传统的多头自注意力替换为线性复杂度的运算,从而显著提升了计算效率。同时,Ister采用倒置季节-趋势分解策略,分离时间序列中的周期性成分,增强了模型对周期模式的学习能力。实验表明,Ister在多个真实数据集上取得了最先进的预测性能。

Comments ICASSP 2026

详情
英文摘要

Transformer-based models have achieved remarkable success in multivariate time series forecasting (MTSF) by capturing long-range dependencies. However, their widespread adoption is hindered by the quadratic computational complexity of self-attention, which limits scalability on high-dimensional sequences. To address this challenge, we propose the Inverted Seasonal-Trend Decomposition Transformer (Ister), a novel architecture that enhances both predictive accuracy and computational efficiency. Central to Ister is Dot-attention, a linear-complexity attention mechanism that replaces conventional multi-head self-attention with element-wise dot-product operations to model inter-series dependencies. Furthermore, we introduce an inverted seasonal-trend decomposition strategy that isolates periodic components, enabling the model to focus learning on periodic patterns, thereby improving the performance of channel alignment. Extensive experiments across several real-world benchmarks demonstrate that Ister consistently achieves state-of-the-art performance. Code is available at https://github.com/macovaseas/Ister.

2412.13547 2026-05-12 cs.CV

Turbo-GS: Accelerating 3D Gaussian Fitting for High-Quality Radiance Fields

Ankit Dhiman, Tao Lu, R Srinath, Emre Arslan, Angela Xing, Yuanbo Xiangli, R Venkatesh Babu, Srinath Sridhar

AI总结 本文提出了一种名为Turbo-GS的方法,旨在加速3D高斯拟合过程,以提高高质量辐射场的生成效率。该方法通过引入稀疏渲染技术和收敛感知的预算控制机制,显著降低了计算开销并提升了学习效率,同时结合位置和外观误差以增强密度优化效果。实验表明,Turbo-GS在保持甚至提升渲染质量的前提下,大幅加快了4K分辨率场景的拟合速度。

Comments Accepted to CVPR 2026. Project page: https://ivl.cs.brown.edu/research/turbo-gs

详情
英文摘要

Novel-view synthesis plays a crucial role in computer vision with applications in 3D reconstruction, mixed reality, and robotics. Recent approaches, such as 3D Gaussian Splatting (3DGS), have emerged as state-of-the-art solutions, offering high-quality novel view synthesis in real time. However, training 3DGS models remains slow, particularly for high-resolution images, often requiring hours to fit a scene with 200 views. In this work, we aim to accelerate the fitting process by reducing computational overhead and improving learning efficiency. Specifically, we introduce a dilated rendering technique that renders only a subset of pixels instead of the full image, significantly reducing computational costs. To enhance learning efficiency, we develop a convergence-aware budget control mechanism that balances the addition of new Gaussians with the optimization of existing ones. Additionally, to improve densification efficiency and prevent gradient vanishing, we incorporate both positional and appearance errors to improve the effectiveness of densification. With these improvements, we achieve fast 4K-resolution fitting while maintaining, or even improving, novel view rendering quality. Extensive experiments demonstrate that our method achieves significantly faster optimization than existing approaches while preserving high rendering fidelity.

2412.10433 2026-05-12 cs.CV cs.LG eess.SP

Implicit Neural Compression of Point Clouds

Hongning Ruan, Yulin Shao, Qianqian Yang, Liang Zhao, Zhaoyang Zhang, Dusit Niyato

AI总结 本文提出了一种基于隐式神经表示的点云压缩框架NeRC$^3$,旨在解决高精度、非结构化点云数据高效压缩的难题。该方法通过两个坐标感知的神经网络分别编码点云的几何结构和属性信息,实现了对点云的隐式表示,并通过参数量化和辅助信息压缩实现高效存储,解码时通过输入坐标重建原始点云。此外,作者还扩展了该方法以处理动态点云,提出了4D-NeRC$^3$,在几何和属性联合压缩方面优于现有标准和方法。

详情
Journal ref
IEEE Transactions on Image Processing, vol. 35, pp. 260-275, 2026
英文摘要

Point clouds have gained prominence across numerous applications due to their ability to accurately represent 3D objects and scenes. However, efficiently compressing unstructured, high-precision point cloud data remains a significant challenge. In this paper, we propose NeRC$^3$, a novel point cloud compression framework that leverages implicit neural representations (INRs) to encode both geometry and attributes of dense point clouds. Our approach employs two coordinate-based neural networks: one maps spatial coordinates to voxel occupancy, while the other maps occupied voxels to their attributes, thereby implicitly representing the geometry and attributes of a voxelized point cloud. The encoder quantizes and compresses network parameters alongside auxiliary information required for reconstruction, while the decoder reconstructs the original point cloud by inputting voxel coordinates into the neural networks. Furthermore, we extend our method to dynamic point cloud compression through techniques that reduce temporal redundancy, including a 4D spatio-temporal representation termed 4D-NeRC$^3$. Experimental results validate the effectiveness of our approach: For static point clouds, NeRC$^3$ outperforms octree-based G-PCC standard and existing INR-based methods. For dynamic point clouds, 4D-NeRC$^3$ achieves superior geometry compression performance compared to the latest G-PCC and V-PCC standards, while matching state-of-the-art learning-based methods. It also demonstrates competitive performance in joint geometry and attribute compression.

2411.10298 2026-05-12 cs.CL

Topological Data Analysis Applications in Natural Language Processing: A Survey

Adaku Uchendu, Thai Le

AI总结 本文综述了拓扑数据分析(TDA)在自然语言处理(NLP)中的应用,探讨了如何利用TDA捕捉语言数据的几何和拓扑特性,以补充传统机器学习方法。研究将现有工作分为理论性和非理论性两类,前者用于解释语言现象,后者则将TDA融入机器学习流程。文章总结了该领域面临的挑战与未来研究方向,为TDA在NLP中的进一步应用提供了参考。

Comments Accepted to ACM SIGKDD Explorations Journal 2026

详情
英文摘要

The surge of data available on the Internet has driven the adoption of a wide range of computational methods for analyzing and extracting insights from large-scale data. Among these, Machine Learning (ML) has become a central paradigm, offering powerful tools for pattern discovery, prediction, and representation learning across many domains. At the same time, real-world data often exhibit properties such as noise, imbalance, sparsity, limited supervision, and high dimensionality, motivating the use of additional analytical perspectives that can complement standard ML pipelines. One such perspective is Topological Data Analysis (TDA), a statistical framework that focuses on the intrinsic shape and structural organization of data. Rather than replacing ML, TDA offers a complementary lens for characterizing geometric and topological properties that may be difficult to capture with conventional feature-based or purely predictive approaches. This has motivated a growing body of work that integrates TDA into ML workflows, particularly in settings where data structure plays an important role. Despite this promise, TDA has received relatively limited attention in Natural Language Processing (NLP) compared to domains with more overt structural regularities, such as computer vision. Nevertheless, a dedicated community of researchers has explored its use in NLP, leading to 137 papers that we comprehensively survey in this work. We organize these studies into theoretical and nontheoretical approaches. Theoretical approaches use topology to explain linguistic phenomena, whereas non-theoretical approaches incorporate TDA into ML-based pipelines through a variety of numerical representations. We conclude by discussing the key challenges and open questions that continue to shape this emerging area. Resources and a list of papers are available at: https://github.com/AdaUchendu/AwesomeTDA4NLP.

2411.08443 2026-05-12 cs.LG cs.CV

Machine Unlearning on Pre-trained Models by Residual Feature Alignment Using LoRA

Laiqiao Qin, Tianqing Zhu, Linlin Wang, Wanlei Zhou

AI总结 本文研究了预训练模型的机器遗忘问题,旨在在不显著影响模型对剩余数据性能的前提下,从已训练模型中移除部分训练数据。为解决传统微调方法计算成本高且可能破坏中间特征的问题,作者提出了一种基于残差特征对齐的高效遗忘方法——Residual Feature Alignment Unlearning,利用LoRA技术分解中间特征,通过调整残差特征实现对遗忘数据和保留数据的有效对齐,实验表明该方法在多个数据集上具有良好的效果。

Comments v2: corrected a sign typo in Algorithm 1 line 13

详情
Journal ref
IEEE Transactions on Dependable and Secure Computing, 2026
英文摘要

Machine unlearning is an emerging technology that removes a subset of the training data from a trained model without significantly affecting the model performance on the remaining data. This topic is becoming increasingly important in protecting user privacy and eliminating harmful or outdated data. The key challenge lies in effectively and efficiently unlearning specific information without compromising the model's utility on the retained data. For pre-trained models, fine-tuning is an important way to achieve the unlearning target. Previous work typically fine-tuned the entire model's parameters, which incurred significant computational costs. In addition, the fine-tuning process may cause shifts in the intermediate layer features, affecting the model's overall utility. In this work, we propose a novel and efficient machine unlearning method for pre-trained models. We term the method Residual Feature Alignment Unlearning. Specifically, we leverage LoRA (Low-Rank Adaptation) to decompose the model's intermediate features into pre-trained features and residual features. By adjusting the residual features, we align the unlearned model with the pre-trained model at the intermediate feature level to achieve both unlearning and remaining targets. The method aims to learn zero residuals on the retained set and shifted residuals on the unlearning set. Extensive experiments on numerous datasets validate the effectiveness of our approach.