arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2506.03362 2026-03-09 cs.RO

Robustness-Aware Tool Selection and Manipulation Planning with Learned Energy-Informed Guidance

Yifei Dong, Yan Zhang, Sylvain Calinon, Florian T. Pokorny

Comments IEEE International Conference on Robotics and Automation (ICRA), 2026

详情

英文摘要

Humans subconsciously choose robust ways of selecting and using tools, for example, choosing a ladle over a flat spatula to serve meatballs. However, robustness under external disturbances remains underexplored in robotic tool-use planning. This paper presents a robustness-aware method that jointly selects tools and plans contact-rich manipulation trajectories, explicitly optimizing for robustness against disturbances. At the core of our method is an energy-based robustness metric that guides the planner toward robust manipulation behaviors. We formulate a hierarchical optimization pipeline that first identifies a tool and configuration that optimizes robustness, and then plans a corresponding manipulation trajectory that maintains robustness throughout execution. We evaluate our method across three representative tool-use tasks. Simulation and real-world results demonstrate that our method consistently selects robust tools and generates disturbance-resilient manipulation plans.

URL PDF HTML ☆

赞 0 踩 0

2506.01646 2026-03-09 cs.CL cs.AI cs.LG

ESGenius: Benchmarking LLMs on Environmental, Social, and Governance (ESG) and Sustainability Knowledge

Chaoyue He, Xin Zhou, Yi Wu, Xinjia Yu, Yan Zhang, Lei Zhang, Di Wang, Shengfei Lyu, Hong Xu, Xiaoqiao Wang, Wei Liu, Chunyan Miao

Comments EMNLP'25 Main Oral (42 pages, 10 figures, 11 tables), Nominations for Resource Award & Theme Paper Award

详情

DOI: 10.18653/v1/2025.emnlp-main.739
Journal ref: In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025), pages 14612-14653

英文摘要

We introduce ESGenius, a comprehensive benchmark for evaluating and enhancing the proficiency of Large Language Models (LLMs) in Environmental, Social, and Governance (ESG) and sustainability-focused question answering. ESGenius comprises two key components: (i) ESGenius-QA, a collection of 1,136 Multiple-Choice Questions (MCQs) generated by LLMs and rigorously validated by domain experts, covering a broad range of ESG pillars and sustainability topics. Each question is systematically linked to its corresponding source text, enabling transparent evaluation and supporting Retrieval-Augmented Generation (RAG) methods; and (ii) ESGenius-Corpus, a meticulously curated repository of 231 foundational frameworks, standards, reports, and recommendation documents from 7 authoritative sources. Moreover, to fully assess the capabilities and adaptation potential of LLMs, we implement a rigorous two-stage evaluation protocol -- Zero-Shot and RAG. Extensive experiments across 50 LLMs (0.5B to 671B) demonstrate that state-of-the-art models achieve only moderate performance in zero-shot settings, with accuracies around 55--70%, highlighting a significant knowledge gap for LLMs in this specialized, interdisciplinary domain. However, models employing RAG demonstrate significant performance improvements, particularly for smaller models. For example, DeepSeek-R1-Distill-Qwen-14B improves from 63.82% (zero-shot) to 80.46% with RAG. These results demonstrate the necessity of grounding responses in authoritative sources for enhanced ESG understanding. To the best of our knowledge, ESGenius is the first comprehensive QA benchmark designed to rigorously evaluate LLMs on ESG and sustainability knowledge, providing a critical tool to advance trustworthy AI in this vital domain.

URL PDF HTML ☆

赞 0 踩 0

2505.21099 2026-03-09 cs.CV

Instance Data Condensation for Image Super-Resolution

Tianhao Peng, Ho Man Kwan, Yuxuan Jiang, Ge Gao, Fan Zhang, Xiaozhong Xu, Shan Liu, David Bull

2505.19297 2026-03-09 cs.CV

Alchemist: Turning Public Text-to-Image Data into Generative Gold

Valerii Startsev, Alexander Ustyuzhanin, Alexey Kirillov, Dmitry Baranchuk, Sergey Kastryulin

Comments Accepted to the Datasets and Benchmarks Track of the 39th Conference on Neural Information Processing Systems

2505.18663 2026-03-09 cs.CV

DVD-Quant: Data-free Video Diffusion Transformers Quantization

Zhiteng Li, Hanxuan Li, Junyi Wu, Kai Liu, Haotong Qin, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang

Comments Code and models will be available at https://github.com/lhxcs/DVD-Quant

2505.13782 2026-03-09 cs.RO cs.SY eess.SY

C*: A Coverage Path Planning Algorithm for Unknown Environments using Rapidly Covering Graphs

Zongyuan Shen, James P. Wilson, Shalabh Gupta

详情

DOI: 10.1109/TRO.2026.3661719
Journal ref: IEEE Transactions on Robotics, Vol 42, pg. 1233-1253, 2026

英文摘要

The paper presents a novel sample-based algorithm, called C*, for real-time coverage path planning (CPP) of unknown environments. C* is built upon the concept of a Rapidly Covering Graph (RCG), which is incrementally constructed during robot navigation via progressive sampling of the search space. By using efficient sampling and pruning techniques, the RCG is constructed to be a minimum-sufficient graph, where its nodes and edges form the potential waypoints and segments of the coverage trajectory, respectively. The RCG tracks the coverage progress, generates the coverage trajectory and helps the robot to escape from the dead-end situations. To minimize coverage time, C* produces the desired back-and-forth coverage pattern, while adapting to the TSP-based optimal coverage of local isolated regions, called coverage holes, which are surrounded by obstacles and covered regions. It is analytically proven that C* provides complete coverage of unknown environments. The algorithmic simplicity and low computational complexity of C* make it easy to implement and suitable for real-time on-board applications. The performance of C* is validated by 1) extensive high-fidelity simulations and 2) laboratory experiments using an autonomous robot. C* yields near optimal trajectories, and a comparative evaluation with seven existing CPP methods demonstrates significant improvements in performance in terms of coverage time, number of turns, trajectory length, and overlap ratio, while preventing the formation of coverage holes. Finally, C* is comparatively evaluated on two different CPP applications using 1) energy-constrained robots and 2) multi-robot teams.

URL PDF HTML ☆

赞 0 踩 0

2505.11165 2026-03-09 cs.LG cs.AI cs.CL cs.CV

Maximizing Asynchronicity in Event-based Neural Networks

Haiqing Hao, Nikola Zubić, Weihua He, Zhipeng Sui, Davide Scaramuzza, Wenhui Wang

Comments 22 pages, 7 figures, 15 tables, ICLR 2026 Camera Ready paper

2505.02387 2026-03-09 cs.CL cs.AI cs.LG

RM-R1: Reward Modeling as Reasoning

Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji

Comments ICLR 2026

2504.20408 2026-03-09 cs.LG cs.AI cs.NA math.NA physics.comp-ph

FourierSpecNet: Neural Collision Operator Approximation Inspired by the Fourier Spectral Method for Solving the Boltzmann Equation

Jae Yong Lee, Gwang Jae Jung, Byung Chan Lim, Hyung Ju Hwang

Comments 37 pages, 17 figures

2504.17703 2026-03-09 cs.LG cs.AI

Federated Learning: A Survey on Privacy-Preserving Collaborative Intelligence

Ratun Rahman

Comments arXiv admin note: This paper has been withdrawn by arXiv due to disputed and unverifiable authorship

2504.14919 2026-03-09 cs.CV

GenCLIP: Generalizing CLIP Prompts for Zero-shot Anomaly Detection

Donghyeong Kim, Chaewon Park, Suhwan Cho, Hyeonjeong Lim, Minseok Kang, Jungho Lee, Sangyoun Lee

详情

DOI: 10.1016/j.patcog.2026.113406
Journal ref: Pattern Recognition, 113406 (2026)

英文摘要

Zero-shot anomaly detection (ZSAD) aims to identify anomalies in unseen categories by leveraging CLIP's zero-shot capabilities to match text prompts with visual features. A key challenge in ZSAD is learning general prompts stably and utilizing them effectively, while maintaining both generalizability and category specificity. Although general prompts have been explored in prior works, achieving their stable optimization and effective deployment remains a significant challenge. In this work, we propose GenCLIP, a novel framework that learns and leverages general prompts more effectively through multi-layer prompting and dual-branch inference. Multi-layer prompting integrates category-specific visual cues from different CLIP layers, enriching general prompts with more comprehensive and robust feature representations. By combining general prompts with multi-layer visual features, our method further enhances its generalization capability. To balance specificity and generalization, we introduce a dual-branch inference strategy, where a vision-enhanced branch captures fine-grained category-specific features, while a query-only branch prioritizes generalization. The complementary outputs from both branches improve the stability and reliability of anomaly detection across unseen categories. Additionally, we propose an adaptive text prompt filtering mechanism, which removes irrelevant or atypical class names not encountered during CLIP's training, ensuring that only meaningful textual inputs contribute to the final vision-language alignment.

URL PDF HTML ☆

赞 0 踩 0

2504.08820 2026-03-09 cs.CL

CAReDiO: Cultural Alignment via Representativeness and Distinctiveness Guided Data Optimization

Jing Yao, Xiaoyuan Yi, Jindong Wang, Zhicheng Dou, Xing Xie

2504.08818 2026-03-09 cs.LG cs.AI

From Tokenizer Bias to Backbone Capability: A Controlled Study of LLMs for Time Series Forecasting

Xinyu Zhang, Shanshan Feng, Xutao Li, Kenghong Lin, Fan Li, Pengfei Jia

2504.08603 2026-03-09 cs.RO cs.AI cs.CV

FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment

Sebastián Barbas Laina, Simon Boche, Sotiris Papatheodorou, Simon Schaefer, Jaehyung Jung, Helen Oleynikova, Stefan Leutenegger

Comments 11 pages, 5 figures

2504.00837 2026-03-09 cs.SD cs.AI cs.MM

A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives

Shuyu Li, Shulei Ji, Zihao Wang, Songruoyao Wu, Jiaxing Yu, Kejun Zhang

2503.21293 2026-03-09 cs.RO

Graph-based Online Lidar Odometry with Retrospective Map Refinement

Aaron Kurda, Simon Steuernagel, Marcus Baum

2503.15625 2026-03-09 cs.CV

EarthScape: A Multimodal Dataset for Surficial Geologic Mapping and Earth Surface Analysis

Matthew Massey, Nusrat Munia, Abdullah-Al-Zubaer Imran

2503.09242 2026-03-09 cs.CV

NAMI: Efficient Image Generation via Bridged Progressive Rectified Flow Transformers

Yuhang Ma, Bo Cheng, Shanyuan Liu, Hongyi Zhou, Liebucha Wu, Dawei Leng, Yuhui Yin

2502.18056 2026-03-09 cs.CV

Escaping The Big Data Paradigm in Self-Supervised Representation Learning

Carlos Vélez García, Miguel Cazorla, Jorge Pomares

Comments Code and implementation available at: https://github.com/inescopresearch/scott

详情

DOI: 10.1016/j.cviu.2026.104698
Journal ref: Computer Vision and Image Understanding, 2026, 104698

英文摘要

The reliance on large-scale datasets and extensive computational resources has become a major barrier to advancing representation learning in vision, especially in data-scarce domains. In this paper, we address the critical question: Can we escape the big data paradigm in self-supervised representation learning from images? We introduce SCOTT (Sparse Convolutional Tokenizer for Transformers), a shallow tokenization architecture that is compatible with Masked Image Modeling (MIM) tasks. SCOTT injects convolutional inductive biases into Vision Transformers (ViTs), enhancing their efficacy in small-scale data regimes. Alongside, we propose to train on a Joint-Embedding Predictive Architecture within a MIM framework (MIM-JEPA), operating in latent representation space to capture more semantic features. Our approach enables ViTs to be trained from scratch on datasets orders of magnitude smaller than traditionally required --without relying on massive external datasets for pretraining. We validate our method on three small-size, standard-resoultion, fine-grained datasets: Oxford Flowers-102, Oxford IIIT Pets-37, and ImageNet-100. Despite the challenges of limited data and high intra-class similarity, frozen SCOTT models pretrained with MIM-JEPA significantly outperform fully supervised methods and achieve competitive results with SOTA approaches that rely on large-scale pretraining, complex image augmentations and bigger model sizes. By demonstrating that robust off-the-shelf representations can be learned with limited data, compute, and model sizes, our work paves the way for computer applications in resource constrained environments such as medical imaging or robotics. Our findings challenge the prevailing notion that vast amounts of data are indispensable for effective representation learning in vision, offering a new pathway toward more accessible and inclusive advancements in the field.

URL PDF HTML ☆

赞 0 踩 0

2502.17721 2026-03-09 cs.LG cs.AI cs.MA

Aligning Compound AI Systems via System-level DPO

Xiangwen Wang, Yibo Jacky Zhang, Zhoujie Ding, Katherine Tsai, Haolun Wu, Sanmi Koyejo

Comments NeurIPS 2025

2502.15805 2026-03-09 cs.LG cs.AI physics.chem-ph

FragFM: Hierarchical Framework for Efficient Molecule Generation via Fragment-Level Discrete Flow Matching

Joongwon Lee, Seonghwan Kim, Seokhyun Moon, Hyunwoo Kim, Woo Youn Kim

Comments Published in International Conference on Learning Representations (ICLR), 2026

2502.13406 2026-03-09 cs.RO cs.AI cs.SY eess.SY

Generative Predictive Control: Flow Matching Policies for Dynamic and Difficult-to-Demonstrate Tasks

Vince Kurtz, Joel W. Burdick

Comments ICRA 2026

2502.12924 2026-03-09 cs.CL cs.AI

Conditioning LLMs to Generate Code-Switched Text

Maite Heredia, Gorka Labaka, Jeremy Barnes, Aitor Soroa

Comments [v2]Added new experiments and analyses [v3]Added out-of-domain evaluation; Accepted to LREC 2026

2502.05151 2026-03-09 cs.CL cs.AI cs.CV cs.LG

Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation

Steffen Eger, Yong Cao, Jennifer D'Souza, Andreas Geiger, Christian Greisinger, Stephanie Gross, Yufang Hou, Brigitte Krenn, Anne Lauscher, Yizhi Li, Chenghua Lin, Nafise Sadat Moosavi, Wei Zhao, Tristan Miller

Comments 46 pages, 7 figures, 7 tables

2502.04843 2026-03-09 cs.CV

PoI: A Filter to Extract Pixel of Interest from Novel Views for Scene Coordinate Regression

Feifei Li, Qi Song, Chi Zhang, Hui Shuai, Rui Huang

2501.17655 2026-03-09 cs.CV

FeatureGS: Eigenvalue-Feature Optimization in 3D Gaussian Splatting for Geometrically Accurate and Artifact-Reduced Reconstruction

Miriam Jäger, Markus Hillemann, Boris Jutzi

Comments 16 pages, 9 figures, 7 tables

详情

DOI: 10.1016/j.ophoto.2025.100100
Journal ref: ISPRS Open Journal of Photogrammetry and Remote Sensing Volume 17, August 2025, 100100

英文摘要

3D Gaussian Splatting (3DGS) has emerged as a powerful approach for 3D scene reconstruction using 3D Gaussians. However, neither the centers nor surfaces of the Gaussians are accurately aligned to the object surface, complicating their direct use in point cloud and mesh reconstruction. Additionally, 3DGS typically produces floater artifacts, increasing the number of Gaussians and storage requirements. To address these issues, we present FeatureGS, which incorporates an additional geometric loss term based on an eigenvalue-derived 3D shape feature into the optimization process of 3DGS. The goal is to improve geometric accuracy and enhance properties of planar surfaces with reduced structural entropy in local 3D neighborhoods.We present four alternative formulations for the geometric loss term based on 'planarity' of Gaussians, as well as 'planarity', 'omnivariance', and 'eigenentropy' of Gaussian neighborhoods. We provide quantitative and qualitative evaluations on 15 scenes of the DTU benchmark dataset focusing on following key aspects: Geometric accuracy and artifact-reduction, measured by the Chamfer distance, and memory efficiency, evaluated by the total number of Gaussians. Additionally, rendering quality is monitored by Peak Signal-to-Noise Ratio. FeatureGS achieves a 30 % improvement in geometric accuracy, reduces the number of Gaussians by 90 %, and suppresses floater artifacts, while maintaining comparable photometric rendering quality. The geometric loss with 'planarity' from Gaussians provides the highest geometric accuracy, while 'omnivariance' in Gaussian neighborhoods reduces floater artifacts and number of Gaussians the most. This makes FeatureGS a strong method for geometrically accurate, artifact-reduced and memory-efficient 3D scene reconstruction, enabling the direct use of Gaussian centers for geometric representation.

URL PDF HTML ☆

赞 0 踩 0

2501.15188 2026-03-09 cs.CL cs.SI physics.soc-ph

Who is the root in a syntactic dependency structure?

Ramon Ferrer-i-Cancho, Marta Arias

Comments Background and discussion improved. Clarity and consistency enhanced. Language improved. Typos corrected

2501.11268 2026-03-09 cs.LG stat.ML

L0-Regularized Quadratic Surface Support Vector Machines

Ahmad Mousavi, Ramin Zandvakili, Zheming Gao

2501.06986 2026-03-09 cs.CV cs.CL

Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs

Mozhgan Nasr Azadani, James Riddell, Sean Sedwards, Krzysztof Czarnecki

Comments Accepted by TMLR

2412.07380 2026-03-09 cs.CL cs.AI

SpecFuse: Ensembling Large Language Models via Next-Segment Prediction

Bo Lv, Nayu Liu, Chen Tang, Xin Liu, Yue Yu, Ping Luo

Comments 15 pages, 5 figures