arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2507.04465 2026-05-12 cs.CV

Visual Hand Gesture Recognition with Deep Learning: A Comprehensive Review of Methods, Datasets, Challenges and Future Research Directions

Konstantinos Foteinos, Manousos Linardakis, Panagiotis Radoglou-Grammatikis, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos

AI总结本文综述了深度学习在视觉手部手势识别（VHGR）领域的研究进展，系统梳理了主流方法、常用数据集及评估指标，旨在为研究人员提供全面的参考指南。文章围绕VHGR的四个核心问题展开，分析当前最先进的方法，并对比不同任务下的性能差异，指出了该领域面临的主要挑战及未来研究方向。

Comments Submitted to Neurocomputing. Rewritten abstract, due to limited space

详情

英文摘要

The rapid evolution of deep learning (DL) models and the ever-increasing size of available datasets have raised the interest of the research community in the always-important field of visual hand gesture recognition (VHGR), and delivered a wide range of applications, such as sign language understanding and human-computer interaction. Despite the large volume of research works in the field, a structured and complete survey on VHGR is still missing, leaving researchers to navigate through hundreds of papers in order to find the current state-of-the-art (SOTA). The current survey aims to fill this gap by presenting a comprehensive overview of this computer vision field. With a systematic research methodology and a structured presentation of the various methods, datasets, and evaluation metrics, this review aims to constitute a useful guideline for researchers, helping them to propose improvements. Specifically, this survey focuses on four fundamental questions: what are the main VHGR aspects, what are the current SOTA methods, what comparative insights can be drawn across methods and tasks, and which challenges shape future research. Starting with the methodology used to locate the related literature, the survey identifies and organizes the key VHGR approaches in a taxonomy-based format. The SOTA methods are grouped across three primary VHGR tasks: static, isolated dynamic and continuous gesture recognition. For each task, the architectural trends and learning strategies are listed. To support the experimental evaluation of future methods in the field, the study reviews commonly used datasets and presents the standard performance metrics. Our survey concludes by identifying the major challenges in VHGR, including both general computer vision issues and domain-specific obstacles, and outlines promising directions for future research.

URL PDF HTML ☆

赞 0 踩 0

2507.04277 2026-05-12 cs.CV

Towards Lightest Low-Light Image Enhancement Architecture for Mobile Devices

Guangrui Bai, Hailong Yan, Wenhai Liu, Yahui Deng, Erbao Dong

AI总结本文提出了一种名为LiteIE的轻量级低光照图像增强框架，旨在在移动设备上实现实时、高效的图像增强。该方法无需大规模标注数据，采用仅含两个卷积层的骨干无关特征提取器和参数免费的迭代修复模块，显著降低了计算量和参数数量。实验表明，LiteIE在LOL数据集上取得了优于现有方法1.4 dB的PSNR性能，且参数仅为同类方法的0.07%，在移动处理器上可实现每秒30帧的4K图像处理，适用于资源受限的边缘设备部署。

Comments Accepted by ESWA

2506.21095 2026-05-12 cs.LG cs.AI

FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation

Xenia Heilmann, Luca Corbucci, Mattia Cerrato, Anna Monreale

AI总结联邦学习（FL）在保护隐私的同时实现了协作训练，但带来了“公平性幻觉”的问题：全局模型在服务器端看起来平均公平，却在客户端层面持续存在歧视。现有增强公平性的联邦学习方法通常只针对单一敏感属性进行偏差缓解，忽略了属性偏差和值偏差这两种现实且冲突的情况。为此，本文提出了FeDa4Fair，首个用于在异构客户端偏差条件下测试公平性方法的基准框架，包含定制数据集生成库、标准化评估套件以及公平性评估函数，为更稳健和可复现的联邦学习公平性研究提供了支持。

Comments Accepted at ACM FAccT 2026

2506.12542 2026-05-12 cs.LG cs.AI cs.CV stat.ML

PLD: A Choice-Theoretic List-Wise Knowledge Distillation

Ejafa Bassam, Dawei Zhu, Kaigui Bian

AI总结本文提出了一种基于选择理论的知识蒸馏方法PLD，将教师网络的logit值解释为类别“价值”得分，并在Plackett-Luce模型框架下构建了一个加权列表级排序损失函数。PLD直接优化教师模型的完整排序结构，将真实标签置于首位，其余类别按教师置信度降序排列，从而生成一个凸且平移不变的替代损失函数。实验表明，PLD在多个数据集和不同架构的师生对中均能实现稳定提升，适用于多种蒸馏目标。

2506.07436 2026-05-12 cs.CV cs.AI cs.ET

Prompt to Protection: A Comparative Study of Multimodal LLMs in Construction Hazard Recognition

Nishi Chaudhary, S M Jamil Uddin, Sathvik Sharath Chandra, Anto Ovid, Alex Albert

AI总结本文对比研究了五种先进的多模态大语言模型在建筑工地危险识别任务中的表现，探讨了不同提示策略对模型性能的影响。研究采用零样本、少样本和思维链（CoT）三种提示方式，发现CoT策略显著提升了模型的识别准确率，且不同模型在不同条件下表现各异，其中GPT-4.5和GPT-o3表现较为突出。研究强调了提示设计在提升多模态大语言模型安全应用性能中的关键作用，为构建更可靠的AI辅助安全系统提供了实用参考。

详情

DOI: 10.1109/ACCESS.2026.3691685

英文摘要

The recent emergence of multimodal large language models (LLMs) has introduced new opportunities for improving visual hazard recognition on construction sites. Unlike traditional computer vision models that rely on domain-specific training and extensive datasets, modern LLMs can interpret and describe complex visual scenes using simple natural language prompts. However, despite growing interest in their applications, there has been limited investigation into how different LLMs perform in safety-critical visual tasks within the construction domain. To address this gap, this study conducts a comparative evaluation of five state-of-the-art LLMs: Claude-3 Opus, GPT-4.5, GPT-4o, GPT-o3, and Gemini 2.0 Pro, to assess their ability to identify potential hazards from real-world construction images. Each model was tested under three prompting strategies: zero-shot, few-shot, and chain-of-thought (CoT). Zero-shot prompting involved minimal instruction, few-shot incorporated basic safety context and a hazard source mnemonic, and CoT provided step-by-step reasoning examples to scaffold model thinking. Quantitative analysis was performed using precision, recall, and F1-score metrics across all conditions. Results reveal that prompting strategy significantly influenced performance, with CoT prompting consistently producing higher accuracy across models. Additionally, LLM performance varied under different conditions, with GPT-4.5 and GPT-o3 outperforming others in most settings. The findings also demonstrate the critical role of prompt design in enhancing the accuracy and consistency of multimodal LLMs for construction safety applications. This study offers actionable insights into the integration of prompt engineering and LLMs for practical hazard recognition, contributing to the development of more reliable AI-assisted safety systems.

URL PDF HTML ☆

赞 0 踩 0

2506.01352 2026-05-12 cs.LG

TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network

Guangxin He, Yuan Cao, Yutong He, Tianyi Bai, Kai Chen, Kun Yuan, Binhang Yuan

AI总结在分布式训练大语言模型中，流水线并行模式下的网络通信瓶颈限制了训练效率。本文提出了一种名为 TAH-Quant 的激活量化框架，通过基于瓦尔德变换的分块自适应量化方法，有效降低中间激活的通信开销，同时保持训练收敛性。实验表明，TAH-Quant 在保证模型性能的前提下，实现了 3-4 位的激活量化，相比传统方法在吞吐量和训练速度上均有显著提升。

2506.01301 2026-05-12 cs.AI cs.CL

Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner

Chunhui Zhang, Zhongyu Ouyang, Kwonjoon Lee, Nakul Agarwal, Sean Dae Houlihan, Soroush Vosoughi, Shao-Yuan Lo

AI总结该研究针对多模态心智理论（ToM）推理中多步骤复杂性的问题，提出了一种可扩展的贝叶斯规划方法，将ToM推理分解为逐步的贝叶斯更新过程。通过引入弱到强的控制机制，使小型语言模型专注于ToM相关的似然估计，并将其推理能力迁移至更大的语言模型中，从而有效整合社交与世界知识。实验表明，该方法在多模态ToM基准测试中相比现有技术提升了4.6%的准确率，尤其在复杂和未见过的场景中表现突出。

Comments Accepted as a Spotlight at the 2025 Forty-Second International Conference on Machine Learning (ICML 2025)

2505.16741 2026-05-12 cs.LG math.OC stat.ML

Meta-reinforcement learning with minimum attention

Shashank Gupta, Pilhwa Lee

AI总结该论文将最小注意原理应用于强化学习，通过在奖励函数中引入最小注意正则化，旨在提升智能体在高维非线性动态环境中的学习效率和稳定性。研究结合模型基于的元学习框架，交替进行模型学习与元策略优化，实验表明该方法在少量样本下的适应能力和对模型与环境扰动的鲁棒性方面优于现有先进算法，并在能量效率方面也表现出改进。

Comments 30 pages, 22 figures

2505.16025 2026-05-12 cs.CV cs.MM eess.IV

Context and Pixel Aware Large Language Model for Video Quality Assessment

Wen Wen, Yaohong Wu, Yue Sheng, Neil Birkbeck, Balu Adsumilli, Yilin Wang

AI总结视频质量评估（VQA）是一个具有广泛应用的挑战性研究课题。为了解决传统模型在上下文理解与像素级失真感知上的不足，以及近期多模态大语言模型在敏感度和任务分离上的问题，本文提出了一种具有上下文和像素感知能力的大语言模型CP-LLM。该模型采用双视觉编码器架构，分别从高层视频语义和底层像素失真两个层面进行感知分析，并通过语言解码器对两者进行融合推理，从而实现对视频质量的鲁棒评分与可解释描述，实验表明其在多个基准测试中表现出色，尤其在像素失真敏感性方面有显著提升。

Comments Accepted to ICIP 2026

2505.15879 2026-05-12 cs.CV cs.AI cs.CL

GRIT: Teaching MLLMs to Think with Images

Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, Xin Eric Wang

AI总结本文提出了一种名为GRIT的新方法，旨在训练多模态大语言模型（MLLMs）在视觉语言任务中结合图像进行推理。GRIT引入了一种基于图像和文本的地面推理范式，使模型在生成推理链时能够交替使用自然语言和明确的图像边界框坐标，从而实现视觉信息的显式整合。通过结合强化学习算法GRPO-GR，GRIT无需标注推理链或边界框标签即可高效训练，仅需少量数据即可显著提升模型生成连贯且视觉支撑的推理链的能力。

2505.12437 2026-05-12 cs.LG cs.AI

A method for the systematic generation of graph XAI benchmarks via Weisfeiler-Leman coloring

Michele Fontanesi, Alessio Micheli, Marco Podda, Domenico Tortorella

AI总结该论文提出了一种系统生成图可解释性（Graph XAI）基准的方法，旨在解决图神经网络（GNN）决策过程不透明的问题。研究利用Weisfeiler-Leman颜色精炼算法从通用图分类数据集中自动构建基准，挖掘具有判别性的子图模式作为解释的代理真实标签，并确保这些模式可被GNN学习。该方法生成了包含15个数据集的OpenGraphXAI基准套件，并提供了生成数千个额外基准的代码，为图解释器的评估提供了更全面和可重复的实验平台。

详情

DOI: 10.1007/s10618-026-01212-z
Journal ref: Data Mining and Knowledge Discovery, vol. 40(4), article no. 42 (2026)

英文摘要

Graph neural networks have become the de facto model for learning from structured data. However, the decision-making process of GNNs remains opaque to the end user, which undermines their use in safety-critical applications. Several explainable AI techniques for graphs have been developed to address this major issue. Focusing on graph classification, these explainers identify subgraph motifs that explain predictions. Therefore, a robust benchmarking of graph explainers is required to ensure that the produced explanations are of high quality, i.e., aligned with the GNN's decision process. However, current graph-XAI benchmarks are limited to simplistic synthetic datasets or a few real-world tasks curated by domain experts, hindering rigorous and reproducible evaluation, and consequently stalling progress in the field. To overcome these limitations, we propose a method to automate the construction of graph XAI benchmarks from generic graph classification datasets. Our approach leverages the Weisfeiler-Leman color refinement algorithm to efficiently perform approximate subgraph matching and mine class-discriminating motifs, which serve as proxy ground-truth class explanations. At the same time, we ensure that these motifs can be learned by GNNs because their discriminating power aligns with WL expressiveness. This work also introduces the OpenGraphXAI benchmark suite, which consists of 15 ready-made graph-XAI datasets derived by applying our method to real-world molecular classification datasets. The suite is available to the public along with a codebase to generate over 2,000 additional graph-XAI benchmarks. Finally, we present a use case that illustrates how the suite can be used to assess the effectiveness of a selection of popular graph explainers, demonstrating the critical role of a sufficiently large benchmark collection for improving the significance of experimental results.

URL PDF HTML ☆

赞 0 踩 0

2505.10872 2026-05-12 cs.RO cs.AI cs.CL

REI-Bench: Can Embodied Agents Understand Vague Human Instructions in Task Planning?

Chenxi Jiang, Chuhao Zhou, Jianfei Yang

AI总结本文提出REI-Bench，一个专门用于评估具身智能体在含模糊指称表达（REs）的人类指令下任务规划能力的基准。研究发现，这类模糊性会显著降低机器人任务规划的成功率，最高可达36.9%。为解决这一问题，作者提出了一种基于任务导向语境认知的方法，通过生成清晰指令有效提升了规划性能，为非专家用户（如老人和儿童）更友好地使用机器人提供了支持。

Comments Accepted at ICLR 2026

2505.02184 2026-05-12 cs.AI cs.DC cs.PL cs.SE

Leveraging LLMs to Automate Energy-Aware Refactoring of Parallel Scientific Codes

Matthew T. Dearing, Yiheng Tao, Xingfu Wu, Zhiling Lan, Valerie Taylor

AI总结本文研究如何利用大语言模型（LLMs）在反馈引导下自动生成能量高效的并行科学代码。为此，提出了一种名为LASSI-EE的自动化重构框架，结合运行时功耗分析、能量感知提示、自我修正反馈机制以及LLM作为评判者的策略，实现迭代优化。实验结果表明，该方法在两种GPU平台上分别平均降低了36%和34%的能耗，展示了其在提升代码能效方面的有效性。

Comments 12 pages, 5 figures, version under review at a peer-reviewed conference

2504.14697 2026-05-12 cs.LG math.AP math.DS stat.ML

Quantitative Clustering in Mean-Field Transformer Models

Shi Chen, Zhengjiang Lin, Yury Polyanskiy, Philippe Rigollet

AI总结本文研究了平均场变换器模型中令牌的长期聚类行为，揭示了在适当参数假设下，模型会以指数速率收敛到一个狄拉克点质量。作者通过定量分析给出了明确的收敛速率，为理解变换器模型中的同步现象提供了理论依据。

Comments 50 pages, 4 figures; We have updated the introduction and added sketches of the proofs of the main theorems

2504.14044 2026-05-12 cs.AI cs.CR

Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy

Regan Bolton, Mohammadreza Sheikhfathollahi, Simon Parkinson, Dan Basher, Howard Parkinson

AI总结本文研究了如何利用大语言模型（LLM）和多阶段检索技术提升铁路等关键基础设施在操作技术网络安全（OTCS）合规性验证的效率与准确性。提出了一种基于多阶段检索的合规架构，通过引入监管标准的额外上下文，显著提高了合规判断的正确性和推理质量。实验表明，该方法在应对网络安全标准如IEC 62443和IEC 63452时，相比基线方法具有明显优势，为缺乏网络安全专业人才的行业提供了有效的合规评估工具。

2504.12501 2026-05-12 cs.LG

Reinforcement Learning from Human Feedback

Nathan Lambert

AI总结《从人类反馈中学习强化学习》一书系统介绍了基于人类反馈的强化学习（RLHF）的核心方法，旨在为具有定量背景的读者提供温和而全面的引导。书中从RLHF的起源出发，涵盖问题定义、数据收集、数学基础，并详细阐述了从指令调优到奖励模型训练、拒绝采样、强化学习及直接对齐算法等关键优化阶段。最后，书中还探讨了合成数据与评估等尚未深入研究的前沿问题，为该领域的发展提供了开放性思考。

Comments 229 pages. Web-native version at https://rlhfbook.com/ Continually improving, latest version at website

2503.18273 2026-05-12 cs.LG

Decoding Islamophobic Discourse: Using LLMs to Identify Tropes and Semi-Coded Hate Speech

Raza Ul Mustafa, Roi Dupart, Gabrielle Smith, Noman Ashraf, Nathalie Japkowicz

AI总结本文研究了在西方社会日益严重的伊斯兰恐惧症话语，特别是通过分析极端社交平台上的半编码术语（如muzrat、pislam等）来识别其隐含的仇恨言论。研究利用大型语言模型（LLMs）和BERT主题建模方法，揭示了这些术语在特定语境下的仇恨属性，并发现伊斯兰恐惧症内容在毒性评分上高于其他类型的仇恨言论。研究还表明，尽管LLMs能够理解这些超出词汇表的侮辱性词语，但当前的审核策略和算法检测仍需进一步改进，以更有效地应对此类话语的传播。

2503.12333 2026-05-12 cs.RO cs.MA

GameChat: Multi-LLM Dialogue for Safe, Agile, and Socially Optimal Multi-Agent Navigation in Constrained Environments

Vagul Mahadevan, Shangtong Zhang, Rohan Chandra

AI总结在拥挤和受限环境中实现安全、敏捷且符合社会规范的多智能体导航仍是一个重大挑战，尤其在去中心化场景中，各智能体具有不同的未知优先级且缺乏中央协调机构。为此，研究提出了一种名为GameChat的方法，通过让智能体使用自然语言进行自主沟通，以解决冲突并实现高效导航。实验表明，该方法在多种场景下显著提升了导航效率和优先级任务的完成率，展示了其在多智能体系统中的有效性和扩展性。

2503.09158 2026-05-12 cs.CV

FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO

Fufangchen Zhao, Songbai Tan, Xuerui Qiu, Linrui Xun, Wenhao Jiang, Jinkai Zheng, Hehe Fan, Jian Gao, Danfeng Yan, Ming Li

AI总结现有视频大语言模型在处理面部视频理解任务时，往往无法有效捕捉与问题相关的细微面部线索。为此，本文提出FaVChat，一种基于分层提示引导的视频理解模型，通过在三个互补层次上强调问题相关的信息，提升了对细微面部特征的推理能力。此外，研究还引入了数据高效的GRPO强化学习策略，以在数据稀缺的情况下提升模型性能，并构建了包含6万段高质量面部视频和17万问答对的FaVChat 170K基准数据集，实验表明该方法在多项面部理解任务中均优于现有模型。

2503.06047 2026-05-12 cs.AI cs.CL

DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments

Wenjie Tang, Yuan Zhou, Erqiang Xu, Keyan Cheng, Minne Li, Liquan Xiao

AI总结 DSGBench 是一个用于评估基于大语言模型（LLM）的智能体在复杂决策环境中的表现的多样化战略博弈基准平台。该平台引入了六个复杂战略游戏，涵盖长期和多维决策需求，并支持不同难度和目标的任务定制。DSGBench 采用细粒度评分系统，从五个具体维度评估智能体的决策能力，同时引入自动决策追踪机制，深入分析智能体的行为模式和策略转折点，为模型选择和未来智能体开发提供了有价值的参考。

Comments 43 pages, 5 figures, conference

2502.13451 2026-05-12 cs.RO

MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation

Lingfeng Zhang, Xiaoshuai Hao, Qinwen Xu, Qiang Zhang, Xinyao Zhang, Pengwei Wang, Jing Zhang, Zhongyuan Wang, Shanghang Zhang, Renjing Xu

AI总结本文提出了一种名为MapNav的全新端到端视觉-语言导航模型，通过引入带注释的语义地图（ASM）替代传统方法中依赖的历史帧，有效降低了存储和计算开销。该方法在每个任务开始时构建顶视图语义地图，并在每一步更新地图，结合显式的文本标签增强导航信息，从而生成结构化且易于理解的导航线索。实验表明，MapNav在模拟和真实环境中均取得了最先进的性能，并开源了ASM生成代码和数据集，为未来研究提供了重要资源。

2502.07553 2026-05-12 cs.LG

Transformers Provably Learn Sparse XOR with Polylogarithmic Parameters

Yaomengxi Han, Debarghya Ghoshdastidar

AI总结本文研究了Transformer模型在学习稀疏异或（XOR）函数问题中的能力，证明了单层双头Transformer仅需对数级别的参数即可成功识别相关特征，并在一次梯度更新后将所有输入的损失降至接近零。该结果突破了传统前馈神经网络（FFNN）在该问题中所需的线性参数瓶颈。此外，实验表明，Transformer的快速特征发现能力源于其精确的softmax注意力机制，优于线性或逐分量注意力等替代方案。

2502.06818 2026-05-12 cs.LG

Rethinking the Global Knowledge of CLIP in Training-Free Open-Vocabulary Semantic Segmentation

Jingyun Wang, Cilin Yan, Guoliang Kang

AI总结本文研究了如何在无需训练的情况下利用CLIP的全局知识进行开放词汇语义分割。与现有方法牺牲全局性以增强局部特征不同，作者重新思考CLIP中编码的全局信息，并提出GCLIP方法，通过重塑最后一层注意力机制和值嵌入，有效整合有用的全局上下文信息。实验表明，该方法在多个基准数据集上显著优于现有最先进方法。

Comments TMM 2026

2502.00816 2026-05-12 cs.LG

Sundial: A Family of Highly Capable Time Series Foundation Models

Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, Mingsheng Long

AI总结本文提出了一种名为 Sundial 的时间序列基础模型家族，能够直接处理连续值时间序列，无需离散分词。通过引入基于流匹配的 TimeFlow Loss，模型在预训练过程中实现了更灵活的表示学习，并能生成多种可能的预测结果。Sundial 在大规模真实世界和合成数据集 TimeBench 上进行预训练，表现出卓越的扩展性和泛化能力，在点预测和概率预测基准测试中均取得了最先进的性能。

2501.03544 2026-05-12 cs.CV cs.AI cs.CR

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Lingzhi Yuan, Xinfeng Li, Chejian Xu, Guanhong Tao, Xiaojun Jia, Yihao Huang, Wei Dong, Yang Liu, Bo Li

AI总结近年来，文本到图像（T2I）模型在生成高质量图像方面表现出色，但其易被用于生成不适宜内容（如色情、暴力等），引发严重伦理问题。为此，本文提出PromptGuard，一种基于软提示引导的内容审核技术，通过在T2I模型的文本嵌入空间中优化一个通用安全软提示，实现对不安全输入的有效抑制，从而生成安全且真实的图像。该方法无需牺牲推理效率或引入代理模型，实验表明其在多个数据集上均能有效降低不安全内容生成，且性能优于现有方法。

Comments Accepted for publication in IEEE Transactions on Information Forensics and Security (TIFS)

2412.18798 2026-05-12 cs.LG cs.AI

Ister: Linear Transformer for Efficient Multivariate Time Series Forecasting

Fanpu Cao, Shu Yang, Zhengjian Chen, Ye Liu, Laizhong Cui

AI总结本文提出了一种名为Ister的线性Transformer模型，用于高效多变量时间序列预测。该模型通过引入点积注意力机制，将传统的多头自注意力替换为线性复杂度的运算，从而显著提升了计算效率。同时，Ister采用倒置季节-趋势分解策略，分离时间序列中的周期性成分，增强了模型对周期模式的学习能力。实验表明，Ister在多个真实数据集上取得了最先进的预测性能。

Comments ICASSP 2026

2412.13547 2026-05-12 cs.CV

Turbo-GS: Accelerating 3D Gaussian Fitting for High-Quality Radiance Fields

Ankit Dhiman, Tao Lu, R Srinath, Emre Arslan, Angela Xing, Yuanbo Xiangli, R Venkatesh Babu, Srinath Sridhar

AI总结本文提出了一种名为Turbo-GS的方法，旨在加速3D高斯拟合过程，以提高高质量辐射场的生成效率。该方法通过引入稀疏渲染技术和收敛感知的预算控制机制，显著降低了计算开销并提升了学习效率，同时结合位置和外观误差以增强密度优化效果。实验表明，Turbo-GS在保持甚至提升渲染质量的前提下，大幅加快了4K分辨率场景的拟合速度。

Comments Accepted to CVPR 2026. Project page: https://ivl.cs.brown.edu/research/turbo-gs

2412.10433 2026-05-12 cs.CV cs.LG eess.SP

Implicit Neural Compression of Point Clouds

Hongning Ruan, Yulin Shao, Qianqian Yang, Liang Zhao, Zhaoyang Zhang, Dusit Niyato

AI总结本文提出了一种基于隐式神经表示的点云压缩框架NeRC$^3$，旨在解决高精度、非结构化点云数据高效压缩的难题。该方法通过两个坐标感知的神经网络分别编码点云的几何结构和属性信息，实现了对点云的隐式表示，并通过参数量化和辅助信息压缩实现高效存储，解码时通过输入坐标重建原始点云。此外，作者还扩展了该方法以处理动态点云，提出了4D-NeRC$^3$，在几何和属性联合压缩方面优于现有标准和方法。

详情

DOI: 10.1109/TIP.2025.3648141
Journal ref: IEEE Transactions on Image Processing, vol. 35, pp. 260-275, 2026

英文摘要

Point clouds have gained prominence across numerous applications due to their ability to accurately represent 3D objects and scenes. However, efficiently compressing unstructured, high-precision point cloud data remains a significant challenge. In this paper, we propose NeRC$^3$, a novel point cloud compression framework that leverages implicit neural representations (INRs) to encode both geometry and attributes of dense point clouds. Our approach employs two coordinate-based neural networks: one maps spatial coordinates to voxel occupancy, while the other maps occupied voxels to their attributes, thereby implicitly representing the geometry and attributes of a voxelized point cloud. The encoder quantizes and compresses network parameters alongside auxiliary information required for reconstruction, while the decoder reconstructs the original point cloud by inputting voxel coordinates into the neural networks. Furthermore, we extend our method to dynamic point cloud compression through techniques that reduce temporal redundancy, including a 4D spatio-temporal representation termed 4D-NeRC$^3$. Experimental results validate the effectiveness of our approach: For static point clouds, NeRC$^3$ outperforms octree-based G-PCC standard and existing INR-based methods. For dynamic point clouds, 4D-NeRC$^3$ achieves superior geometry compression performance compared to the latest G-PCC and V-PCC standards, while matching state-of-the-art learning-based methods. It also demonstrates competitive performance in joint geometry and attribute compression.

URL PDF HTML ☆

赞 0 踩 0

2411.10298 2026-05-12 cs.CL

Topological Data Analysis Applications in Natural Language Processing: A Survey

Adaku Uchendu, Thai Le

AI总结本文综述了拓扑数据分析（TDA）在自然语言处理（NLP）中的应用，探讨了如何利用TDA捕捉语言数据的几何和拓扑特性，以补充传统机器学习方法。研究将现有工作分为理论性和非理论性两类，前者用于解释语言现象，后者则将TDA融入机器学习流程。文章总结了该领域面临的挑战与未来研究方向，为TDA在NLP中的进一步应用提供了参考。

Comments Accepted to ACM SIGKDD Explorations Journal 2026

详情

英文摘要

The surge of data available on the Internet has driven the adoption of a wide range of computational methods for analyzing and extracting insights from large-scale data. Among these, Machine Learning (ML) has become a central paradigm, offering powerful tools for pattern discovery, prediction, and representation learning across many domains. At the same time, real-world data often exhibit properties such as noise, imbalance, sparsity, limited supervision, and high dimensionality, motivating the use of additional analytical perspectives that can complement standard ML pipelines. One such perspective is Topological Data Analysis (TDA), a statistical framework that focuses on the intrinsic shape and structural organization of data. Rather than replacing ML, TDA offers a complementary lens for characterizing geometric and topological properties that may be difficult to capture with conventional feature-based or purely predictive approaches. This has motivated a growing body of work that integrates TDA into ML workflows, particularly in settings where data structure plays an important role. Despite this promise, TDA has received relatively limited attention in Natural Language Processing (NLP) compared to domains with more overt structural regularities, such as computer vision. Nevertheless, a dedicated community of researchers has explored its use in NLP, leading to 137 papers that we comprehensively survey in this work. We organize these studies into theoretical and nontheoretical approaches. Theoretical approaches use topology to explain linguistic phenomena, whereas non-theoretical approaches incorporate TDA into ML-based pipelines through a variety of numerical representations. We conclude by discussing the key challenges and open questions that continue to shape this emerging area. Resources and a list of papers are available at: https://github.com/AdaUchendu/AwesomeTDA4NLP.

URL PDF HTML ☆

赞 0 踩 0

2411.08443 2026-05-12 cs.LG cs.CV

Machine Unlearning on Pre-trained Models by Residual Feature Alignment Using LoRA

Laiqiao Qin, Tianqing Zhu, Linlin Wang, Wanlei Zhou

AI总结本文研究了预训练模型的机器遗忘问题，旨在在不显著影响模型对剩余数据性能的前提下，从已训练模型中移除部分训练数据。为解决传统微调方法计算成本高且可能破坏中间特征的问题，作者提出了一种基于残差特征对齐的高效遗忘方法——Residual Feature Alignment Unlearning，利用LoRA技术分解中间特征，通过调整残差特征实现对遗忘数据和保留数据的有效对齐，实验表明该方法在多个数据集上具有良好的效果。

Comments v2: corrected a sign typo in Algorithm 1 line 13