arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪 全部专题
2511.09771 2026-05-14 cs.CV

STORM: Segment, Track, and Object Re-Localization from a Single Image

Yu Deng, Teng Cao, Hikaru Shindo, Quentin Delfosse, Jiahong Xue, Kristian Kersting

发表机构 * Department of Computer Science, Technical University of Darmstadt, Darmstadt, Hesse, Germany(德累斯顿技术大学计算机科学系) Hessian Center for Artificial Intelligence (hessian.AI), Darmstadt, Hesse, Germany(黑森人工智能中心(hessian.AI)) German Research Center for Artificial Intelligence (DFKI), Darmstadt, Hesse, Germany(德国人工智能研究中心(DFKI)) Centre for Cognitive Science, Technical University of Darmstadt, Darmstadt, Hesse, Germany(德累斯顿技术大学认知科学中心) Google Intrinsic AI Research, Germany. † Work done while at the AIML research lab, now working at Intrinsic, Google.(谷歌Intrinsic AI研究)

AI总结 STORM 是一种统一的框架,能够基于单张参考图像进行条件化的6D姿态估计与跟踪,具有较高的鲁棒性和较低的人工输入需求。该方法结合了分层空间融合注意力机制和基于BCE训练的跟踪验证器,能够在遮挡和快速运动等复杂场景下稳定恢复目标姿态。实验表明,STORM 在无需标注的情况下优于现有方法,并能有效应对严重遮挡和视角变化。

Comments 21 pages. Accepted at the 43rd International Conference on Machine Learning (ICML 2026); camera-ready version

详情
英文摘要

Accurate 6D pose estimation and tracking are core capabilities for physical AI systems, yet real-world deployment remains brittle and labor-intensive. Many pipelines rely on CAD models, manual masking, or per-object adaptation, and still fail under occlusion or fast motion without a principled way to recognize failure. We propose STORM, a unified framework for reference-conditioned 6D tracking that can operate from a single reference image, with minimal manual input and improved robustness. STORM combines: (i) Hierarchical Spatial Fusion Attention (HSFA), a task-driven reference-query fusion architecture that supports both single-reference and multi-reference conditioning and can optionally use vision-language semantic conditioning to resolve instance ambiguities; and (ii) a BCE-trained tracking verifier whose continuous compatibility logit is used as an energy-like score to detect drift and trigger automatic re-initialization. Experiments on LM-O and YCB-Video show that STORM improves annotation-free pose tracking accuracy over strong baselines and recovers reliably from severe occlusions and rapid viewpoint changes with minimal overhead.

2510.13385 2026-05-14 cs.LG

Probabilistic Prediction Markets with Intermittent Contributions

Michael Vitali, Pierre Pinson

发表机构 * Dyson School of Design Engineering, Imperial College London(帝国理工学院伦敦校区设计工程学院) Halfspace, Denmark(丹麦Halfspace公司) Department of Technology, Management and Economics, Technical University of Denmark(丹麦技术大学技术、管理与经济学系) CoRE, Aarhus University(阿arhus大学CoRE)

AI总结 本文研究了在数据所有权和竞争利益限制下,如何通过预测市场机制促进多方协作进行准确预测的问题。提出了一种允许代理自主进出市场、适应动态环境并考虑历史表现的预测市场框架,采用鲁棒回归模型处理缺失提交,并设计了一种兼顾样本内与样本外性能的收益分配机制。实验表明,该设计在模拟和真实数据中均表现出良好的有效性和适应性。

详情
英文摘要

Although both data availability and the demand for accurate forecasts are increasing, collaboration between stakeholders is often constrained by data ownership and competitive interests. In contrast to recent proposals within cooperative game-theoretical frameworks, we place ourselves in a more general framework, based on prediction markets. There, independent agents trade forecasts of uncertain future events in exchange for rewards. We introduce and analyse a prediction market that (i) accounts for the historical performance of the agents, (ii) adapts to time-varying conditions, while (iii) permitting agents to enter and exit the market at will. The proposed design employs robust regression models to learn the optimal forecasts' combination whilst handling missing submissions. Moreover, we introduce a pay-off allocation mechanism that considers both in-sample and out-of-sample performance while satisfying several desirable economic properties. Case-studies using simulated and real-world data allow demonstrating the effectiveness and adaptability of the proposed market design.

2509.22123 2026-05-14 cs.CL

Multilingual Vision-Language Models, A Survey

Andrei-Alexandru Manea, Jindřich Libovický

发表机构 * Faculty of Mathematics and Physics, Charles University, V Holešovičkách 747/2, Prague, Czech Republic(数学与物理系,查尔斯大学,V Holešovičkách 747/2,布拉格,捷克共和国)

AI总结 本文综述了能够处理多语言文本与图像的多语言视觉-语言模型,系统回顾了33个模型和23个基准测试,分析了编码器和生成式架构的发展趋势,并指出了语言中立性与文化适应性之间的关键矛盾。当前训练方法倾向于通过对比学习实现语言中立性,而文化适应性则依赖于多样化数据,多数评估基准优先考虑语义一致性,但近期研究开始引入文化相关的内容以弥补这一差距。

详情
英文摘要

This survey examines multilingual vision-language models that process text and images across languages. We review 33 models and 23 benchmarks, spanning encoder-only and generative architectures, and identify a key tension between language neutrality (consistent cross-lingual representations) and cultural awareness (adaptation to cultural contexts). Current training methods favor neutrality through contrastive learning, while cultural awareness depends on diverse data. Two-thirds of evaluation benchmarks use translation-based approaches prioritizing semantic consistency, though recent work incorporates culturally grounded content. We find discrepancies in cross-lingual capabilities and gaps between training objectives and evaluation goals.

2509.21543 2026-05-14 cs.RO

Self-CriTeach: LLM Self-Teaching and Self-Critiquing for Improving Robotic Planning via Automated Domain Generation

Jinbang Huang, Zhiyuan Li, Yuanzhao Hu, Zhanguang Zhang, Mark Coates, Xingyue Quan, Yingxue Zhang

发表机构 * Huawei Noah's Ark Lab(华为诺亚实验室) University of Toronto(多伦多大学) University of British Columbia(不列颠哥伦比亚大学) McGill University(麦吉尔大学)

AI总结 该研究提出了一种名为 Self-CriTeach 的框架,旨在通过大语言模型(LLM)的自我教学与自我批评机制,提升机器人规划能力。该方法利用 LLM 自主生成符号规划域,既用于生成大规模的机器人任务-计划对以进行监督微调,又作为结构化奖励函数提供密集反馈以增强强化学习。该统一训练流程显著提高了 LLM 的规划成功率、跨任务泛化能力,并降低了推理成本和对不完美逻辑状态的敏感性。

Comments International Conference on Machine Learning (ICML) 2026

详情
英文摘要

Large Language Models (LLMs) have shown strong promise for robotic task planning, particularly through the automatic generation of symbolic planning domains. However, prior work mainly treats generated domains as planning utilities. Such pipelines remain brittle under imperfect logical states and perception noise, while overlooking the potential of generated domains as scalable sources of reasoning supervision and structured reward signals. At the same time, reasoning LLMs depend on chain-of-thought (CoT) supervision, which is expensive to collect for robotic tasks, and reinforcement learning (RL) faces challenges in reward engineering. We propose Self-CriTeach, an LLM self-teaching and self-critiquing framework in which an LLM autonomously generates symbolic planning domains that serve a dual role: (1) In the self-teaching stage, generated domains are used to produce large-scale robotic planning problem--plan pairs, which are automatically converted into extended CoT trajectories for supervised fine-tuning. (2) In the self-critiquing stage, the same domains are reused as structured reward functions, providing dense feedback for reinforcement learning without manual reward engineering. This unified training pipeline yields a planning-enhanced LLM with higher planning success rates, stronger cross-task generalization, reduced inference cost, and improved resistance to imperfect logical states. GitHub Page: https://markli1hoshipu.github.io/Plan_LLM/

2509.20786 2026-05-14 cs.LG

LiLAW: Lightweight Learnable Adaptive Weighting to Learn Sample Difficulty & Improve Noisy Training

Abhishek Moturu, Muhammad Muzammil, Anna Goldenberg, Babak Taati

发表机构 * Department of Computer Science(计算机科学系) University of Toronto(多伦多大学) Department of Mathematics(数学系) The Hospital for Sick Children(圣·玛利亚医院) Department of Statistics(统计学系) UHN KITE Research Institute(UHN KITE研究所) T-CAIREM Vector Institute(向量研究所) Institute of Biomedical Engineering(生物医学工程研究所) Rehabilitation Sciences Institute(康复科学研究所)

AI总结 本文提出了一种轻量可学习的自适应加权方法LiLAW,用于在存在噪声和数据异质性的场景下提升深度神经网络的训练效果。该方法通过三个全局可学习的标量参数动态调整每个样本的损失权重,根据样本难度(易、中、难)进行自适应调整,并在每次训练小批量后使用验证小批量进行一次梯度下降更新,无需干净的验证集。实验表明,LiLAW在多种数据集和噪声条件下均能有效提升模型准确率和AUROC,尤其在高噪声环境下表现突出,且计算高效,适用于资源受限的场景。

详情
英文摘要

Training deep neural networks with noise and data heterogeneity is a major challenge. We introduce Lightweight Learnable Adaptive Weighting (LiLAW), a method that dynamically adjusts the loss weight of each training sample based on its evolving difficulty, categorized as easy, moderate, and hard, using only three global learnable scalar parameters. LiLAW learns to adaptively prioritize samples by updating these parameters with a single gradient descent step on a validation mini-batch after each training mini-batch, without requiring a clean, unbiased validation set. Experiments across general and medical imaging datasets, several noise types and levels, loss functions, and architectures with and without pretraining, including linear probing and full fine-tuning, show that LiLAW consistently improves accuracy and AUROC, especially in higher-noise settings, without requiring excessive tuning. We also obtain state-of-the-art results incorporating synthetic and augmented data from SynPAIN, GAITGen, ECG5000, and improved fairness on the Adult dataset. LiLAW is lightweight, practical, and computationally efficient, making it an effective, scalable approach to boost generalization and robustness across diverse deep learning training setups, especially in resource-constrained settings.

2509.18993 2026-05-14 cs.LG

CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure

Boao Kong, Junzhu Liang, Yuxi Liu, Renjia Deng, Kun Yuan

发表机构 * Peking University(北京大学)

AI总结 本文提出了一种名为CR-Net的参数高效的预训练框架,旨在解决当前低秩结构方法在模型性能、计算开销和激活内存节省方面的不足。CR-Net基于跨层激活残差具有低秩特性的发现,采用双路径架构,通过结合前一层输出与低秩差异高效重建层激活,从而在保持高秩信息的同时大幅减少参数量。实验表明,CR-Net在不同规模的模型(从60M到7B参数)上均优于现有低秩方法,且在计算资源和内存消耗方面表现更优。

Comments 32 pages. Accepted by ICLR 2026

详情
英文摘要

Low-rank architectures have become increasingly important for efficient large language model (LLM) pre-training, providing substantial reductions in both parameter complexity and memory/computational demands. Despite these advantages, current low-rank methods face three critical shortcomings: (1) compromised model performance, (2) considerable computational overhead, and (3) limited activation memory savings. To address these limitations, we propose Cross-layer Low-Rank residual Network (CR-Net), an innovative parameter-efficient framework inspired by our discovery that inter-layer activation residuals possess low-rank properties. CR-Net implements this insight through a dual-path architecture that efficiently reconstructs layer activations by combining previous-layer outputs with their low-rank differences, thereby maintaining high-rank information with minimal parameters. We further develop a specialized activation recomputation strategy tailored for CR-Net that dramatically reduces memory requirements. Extensive pre-training experiments across model scales from 60M to 7B parameters demonstrate that CR-Net consistently outperforms state-of-the-art low-rank frameworks while requiring fewer computational resources and less memory.

2509.13316 2026-05-14 cs.CL cs.LG

Do Activation Verbalization Methods Convey Privileged Information?

Millicent Li, Alberto Mario Ceballos Arroyo, Giordano Rogers, Naomi Saphra, Byron C. Wallace

发表机构 * Northeastern University(东北大学) Kempner Institute, Harvard University(哈佛大学凯姆纳研究所) Boston University(波士顿大学)

AI总结 本文探讨了激活语言化方法是否能揭示大型语言模型(LLM)的内部工作机制。研究发现,现有方法可能更多地反映语言化模型自身的参数知识,而非目标模型的内部状态。实验表明,这些方法在无需访问目标模型内部信息的情况下也能表现良好,说明当前数据集不足以有效评估语言化方法的效果,亟需设计更严格的基准和实验控制来验证其真正的解释能力。

Comments ICML 2026. 41 pages, 23 tables, 6 figures

详情
英文摘要

Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about the inputs provided to it? We critically evaluate popular verbalization methods and datasets used in prior work and find that one can perform well on such benchmarks without access to target model internals, suggesting that these datasets are not ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM that generated them, rather than the knowledge of the target LLM whose activations are decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.

2508.09479 2026-05-14 cs.CV

SkySplat: Generalizable 3D Gaussian Splatting from Multi-Temporal Sparse Satellite Images

Xuejun Huang, Xinyi Liu, Yi Wan, Zhi Zheng, Bin Zhang, Mingtao Xiong, Yingying Pei, Yongjun Zhang

发表机构 * School of Remote Sensing and Information Engineering, Wuhan University(武汉大学遥感与信息工程学院) Technology Innovation Center for Collaborative Applications of Natural Resources Data in GBA, Ministry of Natural Resources(粤港澳大湾区自然资源数据协同应用技术创新中心,自然资源部) Department of Geography and Resource Management, The Chinese University of Hong Kong(香港中文大学地理与资源管理系) China Railway Siyuan Survey and Design Group Co., LTD(中国铁路syuan调查设计集团有限公司)

AI总结 本文提出了一种名为SkySplat的新型自监督框架,旨在从多时相稀疏卫星图像中实现通用化的三维高斯点云重建。该方法通过将有理多项式系数(RPC)模型集成到通用3D高斯点云生成流程中,解决了现有方法在卫星图像处理中几何约束不足、瞬时物体干扰和辐射不一致等问题。SkySplat仅依赖RGB图像和鲁棒的相对高度监督,无需真实高度图即可实现高效且准确的重建,并在多个基准数据集上表现出优越的性能和跨数据集泛化能力。

Comments AAAI 2026. Code is available at https://github.com/NanCheng2001/SkySplat-main

详情
英文摘要

Three-dimensional scene reconstruction from sparse-view satellite images is a long-standing and challenging task. While 3D Gaussian Splatting (3DGS) and its variants have recently attracted attention for its high efficiency, existing methods remain unsuitable for satellite images due to incompatibility with rational polynomial coefficient (RPC) models and limited generalization capability. Recent advances in generalizable 3DGS approaches show potential, but they perform poorly on multi-temporal sparse satellite images due to limited geometric constraints, transient objects, and radiometric inconsistencies. To address these limitations, we propose SkySplat, a novel self-supervised framework that integrates the RPC model into the generalizable 3DGS pipeline, enabling more effective use of sparse geometric cues for improved reconstruction. SkySplat relies only on RGB images and radiometric-robust relative height supervision, thereby eliminating the need for ground-truth height maps. Key components include a Cross-Self Consistency Module (CSCM), which mitigates transient object interference via consistency-based masking, and a multi-view consistency aggregation strategy that refines reconstruction results. Compared to per-scene optimization methods, SkySplat achieves an 86 times speedup over EOGS with higher accuracy. It also outperforms generalizable 3DGS baselines, reducing MAE from 13.18 m to 1.80 m on the DFC19 dataset significantly, and demonstrates strong cross-dataset generalization on the MVS3D benchmark. The is available at https://github.com/NanCheng2001/SkySplat-main

2508.09320 2026-05-14 cs.LG cs.AI cs.CR

Exact Verification of Graph Neural Networks with Incremental Constraint Solving

Minghao Liu, Chia-Hsuan Lu, Marta Kwiatkowska

发表机构 * University of Oxford(牛津大学)

AI总结 该论文提出了一种用于图神经网络(GNN)的精确验证方法,旨在应对属性和结构扰动下的对抗攻击,确保模型的鲁棒性。该方法通过约束求解与边界收紧相结合,并利用求解器的增量求解能力提升效率,支持包括求和、最大值和平均值在内的三种聚合函数,其中后两种为首次应用。实验表明,该方法在多个真实数据集上表现出良好的实用性和优越的分类性能。

Comments Extended version of the paper accepted at FM 2026

详情
英文摘要

Graph neural networks (GNNs) are increasingly often employed in high-stakes applications, such as fraud detection or healthcare, but are susceptible to adversarial attacks. A number of techniques have been proposed to provide adversarial robustness guarantees, but support for commonly used aggregation functions in message-passing GNNs is lacking. In this paper, we develop an exact (sound and complete) verification method for GNNs to compute guarantees against attribute and structural perturbations that involve edge addition or deletion, subject to budget constraints. Our method employs constraint solving with bound tightening, and iteratively solves a sequence of relaxed constraint satisfaction problems while relying on incremental solving capabilities of solvers to improve efficiency. We implement GNNev, a versatile exact verifier for message-passing neural networks, which supports three aggregation functions -- sum, max and mean -- with the latter two considered here for the first time. Extensive experimental evaluation of GNNev on real-world fraud datasets (Amazon and Yelp) and biochemical datasets (MUTAG and ENZYMES) demonstrates its usability and effectiveness, as well as superior performance on node classification and competitiveness on graph classification compared to existing exact verification tools on sum-aggregated GNNs.

2507.12720 2026-05-14 cs.CL

FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Abraham Toluwase Owodunni, Orevaoghene Ahia, Sachin Kumar

发表机构 * The Ohio State University(俄亥俄州立大学) University of Washington(华盛顿大学)

AI总结 本文研究了语言模型在面对新数据分布时的适应性问题,指出传统子词分词器的固定性导致在分布外领域、未见过的语言或脚本中出现文本过度碎片化的问题。为此,作者提出了一种可学习的字节级分词器,通过预测输入字节序列的边界来实现自适应分词,并设计了FLEXITOKENS这一简化训练目标,显著提升了分词的灵活性。实验表明,该方法在多种多语言基准和生成任务中有效减少了分词过度碎片化,相比BPE等传统分词方法在分类和生成任务上提升了约10个百分点。

Comments Accepted to ACL (findings) 2026

详情
英文摘要

Adapting language models to new data distributions by simple finetuning is challenging. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of text in out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries given the input byte sequence, encoding it into variable-length segments. Most tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10% point improvements on token classification and generative tasks compared to BPE and other gradient-based tokenizer baselines. We validate our findings using models of varying sizes, and our method demonstrates consistent improvements across scales. Code and data for our experiments will be released at https://github.com/skai-research/flexitokens

2507.09205 2026-05-14 cs.CL

From Curated Data to Scalable Models: Continual Pre-training of Dense and MoE Large Language Models for Tibetan

Lei Yang, Leiyu Pan, Bojian Xiong, Renren Jin, Shaowei Zhang, Yue Chen, Ling Shi, Jiang Zhou, Junru Wu, Zhen Wang, Jianxiang Peng, Juesi Xiao, Tianyu Dong, Zhuowen Han, Zhuo Chen, Yuqi Ren, Deyi Xiong

发表机构 * TJUNLP Lab(TJUNLP实验室) School of Computer Science and Technology(计算机科学与技术学院) Tianjin University(天津大学)

AI总结 该研究针对藏语这类低资源语言的大规模语言模型发展不足的问题,提出了一套完整的解决方案,包括构建72GB的高质量藏语语料库,并通过多语言持续预训练和指令调优对Qwen2.5-7B模型进行适配。为进一步提升模型容量,研究还将其扩展为50B-10B的专家混合架构,并构建了多个高质量评估数据集。实验表明,所提出的密集模型和MoE模型在多种任务上均优于现有同规模模型,为藏语及其它低资源语言的大模型研究提供了重要参考。

详情
英文摘要

Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, yet their performance remains heavily biased toward high-resource languages. Tibetan, despite its cultural significance and large speaker population, is still substantially underrepresented. In this work, we present a comprehensive pipeline for advancing Tibetan language modeling through large-scale data curation and continual pre-training. We construct a 72 GB high-quality Tibetan corpus, the largest to date, and adapt Qwen2.5-7B through balanced multilingual continual pre-training with Tibetan, Chinese, and English, followed by multilingual instruction tuning. To further scale capacity efficiently, we extend the dense model to a 50B-A10B Mixture-of-Experts architecture. Due to the absence of standardized Tibetan benchmarks, we build multiple evaluation datasets via high-quality translation and human verification. Experimental results show that both dense and MoE models consistently outperform existing open-source and Tibetan-focused models of similar scale across diverse tasks. Our work advances Tibetan-centric LLM research and provides transferable insights for extending LLMs to other low-resource languages. We will release the model weights, evaluation benchmarks, and detailed data processing documentation in the follow-up.

2507.07316 2026-05-14 cs.LG cs.CR

AdeptHEQ-FL: Adaptive Homomorphic Encryption for Federated Learning of Hybrid Classical-Quantum Models with Dynamic Layer Sparing

Md Abrar Jahin, Taufikur Rahman Fuad, M. F. Mridha, Nafiz Fahad, Md. Jakir Hossen

发表机构 * University of Southern California(南加州大学) Islamic University of Technology(伊斯兰科技大学) American International University-Bangladesh(孟加拉国美国国际大学) Multimedia University(多媒体大学)

AI总结 该研究提出了一种名为AdeptHEQ-FL的统一混合经典-量子联邦学习框架,旨在解决非独立同分布环境下模型性能、隐私保护与通信效率之间的平衡问题。该方法结合了混合CNN-PQC架构、基于差分隐私的精度加权聚合策略、选择性同态加密技术以及动态层级自适应冻结机制,实现了对敏感模型层的安全聚合与通信开销的最小化。实验表明,该方法在CIFAR-10等数据集上相比现有方法具有显著的精度提升和通信效率优势,验证了其在隐私保护与资源优化方面的有效性。

Comments Accepted in 1st International Workshop on ICCV'25 BISCUIT (Biomedical Image and Signal Computing for Unbiasedness, Interpretability, and Trustworthiness)

Journal ref 1st International Workshop on BISCUIT at ICCV 2025

详情
英文摘要

Federated Learning (FL) faces inherent challenges in balancing model performance, privacy preservation, and communication efficiency, especially in non-IID decentralized environments. Recent approaches either sacrifice formal privacy guarantees, incur high overheads, or overlook quantum-enhanced expressivity. We introduce AdeptHEQ-FL, a unified hybrid classical-quantum FL framework that integrates (i) a hybrid CNN-PQC architecture for expressive decentralized learning, (ii) an adaptive accuracy-weighted aggregation scheme leveraging differentially private validation accuracies, (iii) selective homomorphic encryption (HE) for secure aggregation of sensitive model layers, and (iv) dynamic layer-wise adaptive freezing to minimize communication overhead while preserving quantum adaptability. We establish formal privacy guarantees, provide convergence analysis, and conduct extensive experiments on the CIFAR-10, SVHN, and Fashion-MNIST datasets. AdeptHEQ-FL achieves a $\approx 25.43\%$ and $\approx 14.17\%$ accuracy improvement over Standard-FedQNN and FHE-FedQNN, respectively, on the CIFAR-10 dataset. Additionally, it reduces communication overhead by freezing less important layers, demonstrating the efficiency and practicality of our privacy-preserving, resource-aware design for FL.

2505.21238 2026-05-14 cs.CV

3D-UIR: 3D Gaussian for Underwater 3D Scene Reconstruction via Physics Based Appearance-Medium Decoupling

Jieyu Yuan, Yujun Li, Yuanlin Zhang, Chunle Guo, Xiongxin Tang, Ruixing Wang, Chongyi Li

发表机构 * VCIP, College of Computer Science, Nankai University(VCIP,计算机科学学院,南开大学) Institute of Software, Chinese Academy of Sciences(软件研究所,中国科学院) DJI(大疆创新)

AI总结 该论文提出了一种基于物理原理的3D高斯点云方法(3D-UIR),用于解决水下三维场景重建中的光-介质耦合问题。通过将物体外观与水介质效应解耦,并引入显式的介质嵌入表示,有效提升了场景的一致性和渲染质量。此外,该方法结合深度引导的优化策略,提高了几何重建的准确性,在水下场景的视图合成和场景恢复方面取得了显著改进。

Comments Accepted to IEEE TIP 2026. Project webpage: https://bilityniu.github.io/3D-UIR

详情
英文摘要

Novel view synthesis for underwater scene reconstruction presents unique challenges due to complex light-media interactions. Optical scattering and absorption in water body bring inhomogeneous medium attenuation interference that disrupts conventional volume rendering assumptions of uniform propagation medium. While 3D Gaussian Splatting (3DGS) offers real-time rendering capabilities, it struggles with underwater inhomogeneous environments where scattering media introduces artifacts and inconsistent appearance. In this study, we propose a physics-based framework that disentangles object appearance from water medium effects through tailored Gaussian modeling. Our approach introduces appearance embeddings, which are explicit medium representations for backscatter and attenuation, enhancing scene consistency. In addition, we propose a depth-guided optimization strategy that leverages pseudo-depth maps as supervision with depth regularization and scale penalty terms to improve geometric fidelity. By integrating the proposed appearance and medium modeling components via an underwater imaging model, our approach achieves both high-quality novel view synthesis and physically accurate scene restoration. Experiments demonstrate our significant improvements in rendering quality and restoration accuracy over existing methods. The project page is available at https://bilityniu.github.io/3D-UIR.

2505.15616 2026-05-14 cs.CV

LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models

Ruilin Yao, Bo Zhang, Jirui Huang, Xinwei Long, Yifang Zhang, Tianyu Zou, Yufei Wu, Shichao Su, Yifan Xu, Wenxi Zeng, Zhaoyu Yang, Guoyou Li, Shilan Zhang, Zichan Li, Yaxiong Chen, Shengwu Xiong, Peng Xu, Jiajun Zhang, Bowen Zhou, David Clifton, Luc Van Gool

发表机构 * Wuhan University of Technology(武汉理工大学) Tsinghua University(清华大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Shanghai AI Lab(上海人工智能实验室) University of Oxford(牛津大学) INSAIT, Sofia Un. St Kliment Ohridski(索菲亚大学克里门特·欧里迪斯基学院)

AI总结 该研究提出了LENS,一个多层级的基准测试,用于评估多模态大语言模型在感知、理解和推理任务中的综合能力。LENS包含3400张当代图像和6万余个由人类撰写的问答,覆盖八个任务和十二种日常场景,支持从基础感知到复杂推理的多层次评估。该数据集通过丰富的标注和来自社交媒体的高质量图像,能够更真实地反映模型在现实场景中的表现,实验表明当前前沿模型在推理任务上的准确率均未超过60%。

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

Multimodal Large Language Models (MLLMs) have achieved significant advances in integrating visual and linguistic information, yet their ability to reason about complex and real-world scenarios remains limited. The existing benchmarks are usually constructed in the task-oriented manner without guarantee that different task samples come from the same data distribution, thus they often fall short in evaluating the synergistic effects of lower-level perceptual capabilities on higher-order reasoning. To lift this limitation, we contribute Lens, a multi-level benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios, forming three progressive task tiers, i.e., perception, understanding, and reasoning. One feature is that each image is equipped with rich annotations for all tasks. Thus, this dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning. In addition, our images are manully collected from the social media, in which 53% were published later than Jan. 2025. We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL. These models are released later than Dec. 2024, and none of them achieve an accuracy greater than 60% in the reasoning tasks. Project page: https://github.com/Lens4MLLMs/lens. ICCV 2025 workshop page: https://lens4mllms.github.io/mars2-workshop-iccv2025/

2505.09760 2026-05-14 cs.RO cs.NE

Neural Associative Skill Memories for safer robotics and modelling human sensorimotor repertoires

Pranav Mahajan, Mufeng Tang, T. Ed Li, Ioannis Havoutis, Ben Seymour

发表机构 * University of Oxford(牛津大学) Yale University(耶鲁大学)

AI总结 本文提出了一种名为神经关联技能记忆(Neural Associative Skill Memories)的框架,旨在提升机器人在复杂环境中的安全性和适应性。该方法通过自监督预测编码实现技能学习与表达的统一,无需显式选择技能即可根据上下文进行技能识别与执行,并具备故障检测能力。相比传统方法,该模型采用局部学习规则,实现了与生物运动准备相关的速度-精度权衡,为神经机器人学和人类感觉运动学习提供了新的计算视角。

Journal ref Neural Computation (2026) 38 (1): 1-27

详情
英文摘要

Modern robots face challenges shared by humans, where machines must learn multiple sensorimotor skills and express them adaptively. Equipping robots with a human-like memory of how it feels to do multiple stereotypical movements can make robots more aware of normal operational states and help develop self-preserving safer robots. Associative Skill Memories (ASMs) aim to address this by linking movement primitives to sensory feedback, but existing implementations rely on hard-coded libraries of individual skills. A key unresolved problem is how a single neural network can learn a repertoire of skills while enabling fault detection and context-aware execution. Here we introduce Neural Associative Skill Memories (ASMs), a framework that utilises self-supervised predictive coding for temporal prediction to unify skill learning and expression, using biologically plausible learning rules. Unlike traditional ASMs which require explicit skill selection, Neural ASMs implicitly recognize and express skills through contextual inference, enabling fault detection across learned behaviours without an explicit skill selection mechanism. Compared to recurrent neural networks trained via backpropagation through time, our model achieves comparable qualitative performance in skill memory expression while using local learning rules and predicts a biologically relevant speed-accuracy trade-off during skill memory expression. This work advances the field of neurorobotics by demonstrating how predictive coding principles can model adaptive robot control and human motor preparation. By unifying fault detection, reactive control, skill memorisation and expression into a single energy-based architecture, Neural ASMs contribute to safer robotics and provide a computational lens to study biological sensorimotor learning.

2502.18917 2026-05-14 cs.AI cs.PL cs.SE

ClassInvGen: Class Invariant Synthesis using Large Language Models

Chuyue Sun, Viraj Agashe, Saikat Chakraborty, Jubi Taneja, Clark Barrett, David Dill, Xiaokang Qiu, Shuvendu K. Lahiri

发表机构 * Stanford University(斯坦福大学) Microsoft Research(微软研究院) Purdue University(普渡大学)

AI总结 ClassInvGen 是一种利用大语言模型(LLM)生成类不变式的方法,旨在为如 C++ 等主流编程语言生成高质量的类不变式。该方法通过协同生成可执行的类不变式和测试输入,提升了不变式的准确性和完整性,并在实验中优于基于纯 LLM 和传统数据驱动的方法。研究还构建了一个包含标准 C++ 数据结构的基准测试集,并通过实际案例验证了其在真实代码库中的应用效果。

详情
英文摘要

Formal program specifications in the form of preconditions, postconditions, and class invariants have several benefits for the construction and maintenance of programs. They not only aid in program understanding due to their unambiguous semantics but can also be enforced dynamically (or even statically when the language supports a formal verifier). However, synthesizing high-quality specifications in an underlying programming language is limited by the expressivity of the specifications or the need to express them in a declarative manner. Prior work has demonstrated the potential of large language models (LLMs) for synthesizing high-quality method pre/postconditions for Python and Java, but does not consider class invariants. In this work, we describe ClassInvGen, a method for co-generating executable class invariants and test inputs to produce high-quality class invariants for a mainstream language such as C++, leveraging LLMs' ability to synthesize pure functions. We show that ClassInvGen outperforms a pure LLM-based technique to generate specifications (from code) as well as prior data-driven invariant inference techniques such as Daikon. We contribute a benchmark of standard C++ data structures along with a harness that can help measure both the correctness and completeness of generated specifications using tests and mutants. We also demonstrate its applicability to real-world code by performing a case study on several classes within a widely used and high-integrity C++ codebase.

2502.05157 2026-05-14 cs.LG cs.DS

Efficient distributional regression trees learning algorithms for calibrated non-parametric probabilistic forecasts

Quentin Duchemin, Guillaume Obozinski

发表机构 * Swiss Data Science Center(瑞士数据科学中心) EPFL(瑞士联邦理工学院) ETH Zürich(苏黎世联邦理工学院)

AI总结 本文提出了一种高效的概率回归树学习算法,用于在加权区间分数(WIS)或连续排名概率分数(CRPS)损失函数下进行校准的非参数概率预测。通过引入最小最大堆、权重平衡二叉树和Fenwick树等数据结构,算法在计算效率上得到了显著提升。该方法不仅在数值实验中表现出与现有方法相当的性能,还继承了树模型的可解释性,适用于符合预测和组条件覆盖率保证的场景。

详情
英文摘要

The perspective of developing trustworthy AI for critical applications in science and engineering requires machine learning techniques that are capable of estimating their own uncertainty. In the context of regression, instead of estimating a conditional mean, this can be achieved by producing a predictive interval for the output, or to even learn a model of the conditional probability $p(y|x)$ of an output $y$ given input features $x$. While this can be done under parametric assumptions with, e.g. generalized linear model, these are typically too strong, and non-parametric models offer flexible alternatives. In particular, for scalar outputs, learning directly a model of the conditional cumulative distribution function of $y$ given $x$ can lead to more precise probabilistic estimates, and the use of proper scoring rules such as the weighted interval score (WIS) and the continuous ranked probability score (CRPS) lead to better coverage and calibration properties. This paper introduces novel algorithms for learning probabilistic regression trees for the WIS or CRPS loss functions. These algorithms are made computationally efficient thanks to an appropriate use of known data structures - namely min-max heaps, weight-balanced binary trees and Fenwick trees. Through numerical experiments, we demonstrate that the performance of our methods is competitive with alternative approaches. Additionally, our methods benefit from the inherent interpretability and explainability of trees. As a by-product, we show how our trees can be used in the context of conformal prediction and explain why they are particularly well-suited for achieving group-conditional coverage guarantees.

2501.10598 2026-05-14 cs.LG

Addressing Finite-Horizon MDPs via Low-Rank Tensor Value Approximation

Sergio Rozada, Jose Luis Orejuela, Antonio G. Marques

发表机构 * Department of Signal Theory and Comms.(信号理论与通讯系) King Juan Carlos University(国王胡安·卡洛斯大学)

AI总结 本文研究了在有限时间范围的马尔可夫决策过程(MDPs)中,利用低秩张量近似值函数的方法学习最优策略的问题。针对有限时间MDPs中值函数非平稳带来的高维问题和样本复杂度高的挑战,作者提出将值函数建模为低秩张量,从而实现可扩展的表示形式,并在策略迭代框架下结合低秩策略评估与贪心策略改进,计算近似最优策略。该方法引入了基于优化的贝尔曼方程求解框架及块坐标下降算法,并在未知系统动态情况下通过采样轨迹估计值函数,实验表明该方法在计算效率和策略性能方面均具有优势。

详情
英文摘要

We study the problem of learning optimal policies in finite-horizon Markov Decision Processes (MDPs) using low-rank reinforcement learning (RL) methods. In finite-horizon MDPs, the policies, and therefore the value functions (VFs) are not stationary. This aggravates the challenges of high-dimensional MDPs, as they suffer from the curse of dimensionality and high sample complexity. To address these issues, we propose modeling the VFs of finite-horizon MDPs as low-rank tensors, enabling a scalable representation that renders the problem of learning optimal policies tractable. Our approach focuses on VF approximation within a policy iteration framework, where low-rank policy evaluation is combined with greedy policy improvement to compute near-optimal policies. We introduce an optimization-based framework for solving the Bellman equations with low-rank constraints, along with block-coordinate descent (BCD) and block-coordinate gradient descent (BCGD) algorithms, both with theoretical convergence guarantees. We further establish that bounded low-rank policy evaluation error translates into bounded policy improvement in the finite-horizon setting. For scenarios where the system dynamics are unknown, we adapt the proposed BCGD method to estimate the VFs using sampled trajectories. Numerical experiments further demonstrate that the proposed framework reduces computational demands in controlled synthetic scenarios and more realistic resource allocation problems, while achieving competitive policy performance in terms of attained returns.

2501.05982 2026-05-14 cs.LG eess.SP

Deep Variational Sequential Monte Carlo for High-Dimensional Observations

Wessel L. van Nierop, Nir Shlezinger, Ruud J. G. van Sloun

发表机构 * Dept. of Electrical Engineering(电气工程系) Eindhoven University of Technology(埃因霍温理工大学) Dept. of Electrical and Computer Engineering(电气与计算机工程系) Ben-Gurion University of the Negev(贝内-杰尔大学)

AI总结 本文提出了一种基于深度变分思想的序列蒙特卡洛方法,用于处理高维观测下的非线性状态空间系统。该方法通过神经网络参数化提议分布和状态转移分布,利用无监督变分SMC目标进行学习,从而提升粒子滤波的性能。实验表明,该方法在高维部分观测下对洛伦兹吸引子的跟踪任务中优于现有基准,并且在证据下界评估中显示出对后验分布更准确的建模能力。

Journal ref ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025

详情
英文摘要

Sequential Monte Carlo (SMC), or particle filtering, is widely used in nonlinear state-space systems, but its performance often suffers from poorly approximated proposal and state-transition distributions. This work introduces a differentiable particle filter that leverages the unsupervised variational SMC objective to parameterize the proposal and transition distributions with a neural network, designed to learn from high-dimensional observations. Experimental results demonstrate that our approach outperforms established baselines in tracking the challenging Lorenz attractor from high-dimensional and partial observations. Furthermore, an evidence lower bound based evaluation indicates that our method offers a more accurate representation of the posterior distribution.

2410.22643 2026-05-14 cs.RO

An Overtaking Trajectory Planning Framework Based on Spatio-temporal Topology and Reachable Set Analysis Ensuring Time Efficiency

Wule Mao, Zhouheng Li, Entao Sun, Lei Xie, Hongye Su

发表机构 * State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou, 310027, China(工业控制技术国家重点实验室,浙江大学,杭州,310027,中国) Shanghai STEP Electric Corporation, Shanghai, 201802, China(上海STEP电力有限公司,上海,201802,中国)

AI总结 本文提出了一种基于时空拓扑和可达集分析的超车轨迹规划框架(SROP),旨在解决高速场景下传统分层规划方法易陷入局部最优和计算效率低的问题。该框架通过引入拓扑类别表示不同的超车行为,上层规划器进行时空搜索以生成多样化的初始路径,下层规划器利用可达集并行评估轨迹,从而解耦车辆运动学约束并加速计算。实验表明,SROP在轨迹平滑性和计算效率方面均有显著提升,并在F1TENTH仿真平台中验证了其在复杂场景下的实用性和鲁棒性。

详情
英文摘要

Generating overtaking trajectories in high-speed scenarios is typically addressed through hierarchical planning, which often suffers from local optima due to single initial solutions and low computational efficiency during numerical optimization. To overcome these limitations, this paper proposes a Spatio-temporal topology and Reachable set analysis enhanced Overtaking trajectory Planning framework (SROP). Specifically, by introducing topological classes to represent distinct overtaking behaviors, the upper-layer planner performs a spatio-temporal search to extract diverse initial paths, effectively preventing local optima. Subsequently, a lower-layer planner conducts parallel trajectory evaluation using reachable sets, which decouples vehicle kinematic constraints from the optimization process to ensure feasibility and significantly accelerate computation. Numerical experiments demonstrate that SROP improves trajectory smoothness by 66.8% and reduces computation time by 62.9% compared to state-of-the-art methods. Furthermore, by seamlessly integrating the method into the F1TENTH autonomous racing simulation platform, a 100-lap sensitivity analysis demonstrates high overtaking success rates in challenging scenarios, thereby validating its practical utility, real-time efficiency, and robustness.

2409.02708 2026-05-14 cs.LG stat.ME

Few-shot Multi-Task Learning of Linear Invariant Features with Meta Subspace Pursuit

Chaozhi Zhang, Lin Liu, Xiaoqun Zhang

发表机构 * School of Mathematical Sciences, Shanghai Jiao Tong University(上海交通大学数学科学学院) Institute of Natural Sciences, MOE-LSC, Shanghai Jiao Tong University(上海交通大学自然科学研究院) SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University(上海交通大学-耶鲁大学生物统计与数据科学联合中心)

AI总结 本文研究了在数据稀缺情况下如何通过多任务学习提取线性不变特征的问题,提出了一种名为Meta Subspace Pursuit(Meta-SP)的新算法,用于学习不同任务间共享的低秩不变子空间。该方法在算法层面和统计层面均提供了理论保证,并通过大量实验验证了其在性能上的优越性,优于包括ANIL在内的多种对比方法。

Journal ref CSIAM Transactions on Applied Mathematics (2026)

详情
英文摘要

Data scarcity poses a serious threat to modern machine learning and artificial intelligence, as their practical success typically relies on the availability of big datasets. One effective strategy to mitigate the issue of insufficient data is to first harness information from other data sources possessing certain similarities in the study design stage, and then employ the multi-task or meta learning framework in the analysis stage. In this paper, we focus on multi-task (or multi-source) linear models whose coefficients across tasks share an invariant low-rank component, a popular structural assumption considered in the recent multi-task or meta learning literature. Under this assumption, we propose a new algorithm, called Meta Subspace Pursuit (abbreviated as Meta-SP), that provably learns this invariant subspace shared by different tasks. Under this stylized setup for multi-task or meta learning, we establish both the algorithmic and statistical guarantees of the proposed method. Extensive numerical experiments are conducted, comparing Meta-SP against several competing methods, including popular, off-the-shelf model-agnostic meta learning algorithms such as ANIL. These experiments demonstrate that Meta-SP achieves superior performance over the competing methods in various aspects.

2409.02038 2026-05-14 cs.CL cs.AI cs.DB

BEAVER: An Enterprise Benchmark for Text-to-SQL

Peter Baile Chen, Devin Yang, Weiyue Li, Fabian Wenz, Yi Zhang, Nesime Tatbul, Michael Cafarella, Çağatay Demiralp, Michael Stonebraker

发表机构 * MIT(麻省理工学院) Harvard University(哈佛大学) Greenshoe, Inc.(Greenshoe公司)

AI总结 BEAVER 是首个基于私有数据仓库构建的文本到 SQL 基准测试集,旨在评估大语言模型在复杂企业环境中的表现。该基准包含来自真实查询日志的 9128 个问题-SQL 对,覆盖 19 个不同领域,涵盖复杂的数据库结构和专业领域知识。为解决企业数据稀缺和评估指标不足的问题,BEAVER 通过合成高质量专家验证查询,并引入细粒度子任务评估指标,揭示了当前先进模型在实际企业场景中的显著性能差距。

Comments Dataset and code are available at https://beaverbench.github.io/

详情
英文摘要

Existing text-to-SQL benchmarks have largely been constructed from public databases with well-structured schemas and simplistic question-SQL pairs. While large language models (LLMs) excel on these settings, their efficacy in complex private enterprise environments, characterized by intricate schemas, domain knowledge, and analytical user queries involving sophisticated structures and functions, remains unproven. To bridge this gap, we introduce BEAVER, the first text-to-SQL benchmark derived from private data warehouses. It comprises 9128 question-SQL pairs sourced from real-world query logs and 812 tables across 19 diverse domains. Building this benchmark is challenging because (1) enterprise query logs are scarce due to privacy constraints, and (2) existing all-or-nothing evaluation metrics based on accuracy make error diagnosis difficult -- especially when producing a correct query involves solving multiple compounded challenges, such as domain knowledge and query complexity. We address these issues at two levels. At the dataset level, we synthesize high-fidelity, expert-verified queries that increase dataset size and isolate individual challenges or combine them, producing queries focused on domain knowledge, query complexity, and both. At the evaluation level, we provide human annotations and evaluation metrics for five critical subtasks to enable fine-grained analysis. Our evaluation reveals a significant performance gap compared to existing benchmarks: SOTA agentic frameworks using the advanced model GPT-5.2 achieve only 10.8% accuracy. When provided with all subtask annotations as oracle hints, accuracy increases to 30.1%, confirming that a major bottleneck lies in correctly resolving these subtasks. Finally, we provide a taxonomy of the residual errors that persist even with subtask hints, identifying specific challenges such as the use of advanced functions.

2110.00062 2026-05-14 cs.RO cs.SY eess.SY

Simulation-based multi-criteria comparison of mono-articular and bi-articular exoskeletons during walking with and without load

Ali KhalilianMotamed Bonab, Volkan Patoglu

发表机构 * Faculty of Engineering and Natural Sciences(工程与自然科学学院)

AI总结 本文通过仿真方法对单关节和双关节外骨骼在不同负载条件下的行走性能进行了多目标比较,研究了外骨骼动力学特性与辅助扭矩对代谢成本、肌肉激活和关节反作用力的影响。作者提出了一种基于帕累托优化的多目标设计方法,同时优化外骨骼的功耗和人体代谢率降低效果,并考虑了设备惯性和电能再生的影响。研究结果表明,尽管两种外骨骼的辅助水平相近,但单关节外骨骼在降低关节峰值反作用力方面表现更优,而双关节外骨骼的功耗对负载变化的敏感性更低,且其惯性对代谢成本的负面影响较小。

详情
英文摘要

Developing exoskeletons that can reduce the metabolic cost of assisted subjects is challenging since a systematic design approach is required to capture the effects of device dynamics and the assistance torques on human performance. Design studies that rely on musculoskeletal models hold high promise in providing effective design guidelines, as the effect of various devices and different assistance torque profiles on metabolic cost can be studied systematically. In this paper, we present a simulation-based multi-criteria design approach to systematically study the effect of different device kinematics and corresponding optimal assistive torque profiles under actuator saturation on the metabolic cost, muscle activation, and joint reaction forces of subjects walking under different loading conditions. For the multi-criteria comparison of exoskeletons, we introduce a Pareto optimization approach to simultaneously optimize the exoskeleton power consumption and the human metabolic rate reduction during walking, under different loading conditions. We further superpose the effects of device inertia and electrical regeneration on the metabolic rate and power consumption, respectively. Our results explain the effects of heavy loads on the optimal assistance profiles of the exoskeletons and provide guidelines on choosing optimal device configurations under actuator torque limitations, device inertia, and regeneration effects. The multi-criteria comparison of devices indicates that despite the similar assistance levels of both devices, mono-articular exoskeletons show better performance on reducing the peak reaction forces, while the power consumption of bi-articular devices is less sensitive to the loading. Furthermore, for the bi-articular exoskeletons, the device inertia has lower detrimental effects on the metabolic cost of subjects and does not affect the Pareto-optimality of solutions.

2008.03496 2026-05-14 cs.AI cs.LO cs.RO

Human Robot Collaborative Assembly Planning: An Answer Set Programming Approach

Momina Rizwan, Volkan Patoglu, Esra Erdem

发表机构 * Faculty of Engineering and Natural Sciences, Sabancı University, Istanbul, Turkey(工程与自然科学学院,萨班奇大学,伊斯坦布尔,土耳其)

AI总结 本文研究了人机协作装配任务中的规划问题,提出了一种基于答案集编程的方法,结合常识推理和丰富的通信动作,以应对人类行为不确定性带来的挑战。该方法通过扩展混合条件规划,实现了对装配动作顺序的高层规划与几何可行性验证,并在实际场景中验证了其有效性,展示了双臂机器人与人类协作组装家具的应用案例。

Comments 36th International Conference on Logic Programming (ICLP 2020), University Of Calabria, Rende (CS), Italy, September 2020, 15 pages

详情
英文摘要

For planning an assembly of a product from a given set of parts, robots necessitate certain cognitive skills: high-level planning is needed to decide the order of actuation actions, while geometric reasoning is needed to check the feasibility of these actions. For collaborative assembly tasks with humans, robots require further cognitive capabilities, such as commonsense reasoning, sensing, and communication skills, not only to cope with the uncertainty caused by incomplete knowledge about the humans' behaviors but also to ensure safer collaborations. We propose a novel method for collaborative assembly planning under uncertainty, that utilizes hybrid conditional planning extended with commonsense reasoning and a rich set of communication actions for collaborative tasks. Our method is based on answer set programming. We show the applicability of our approach in a real-world assembly domain, where a bi-manual Baxter robot collaborates with a human teammate to assemble furniture. This manuscript is under consideration for acceptance in TPLP.

1811.12784 2026-05-14 cs.CV

The GAN that Warped: Semantic Attribute Editing with Unpaired Data

Gara Dorta, Sara Vicente, Neill D. F. Campbell, Ivor J. A. Simpson

发表机构 * University of Bath(巴斯大学) Anthropics Technology Ltd.(Anthropics技术有限公司) University of Sussex(苏塞克斯大学)

AI总结 该研究提出了一种基于平滑变形场的语义图像编辑方法,能够在不依赖配对数据的情况下实现高质量的图像编辑。通过结合生成对抗网络(GAN)的最新进展,该方法能够使用未配对数据进行训练,有效保留图像主体的身份特征,并在高分辨率(如4K)图像上实现了高效的编辑。实验表明,该方法在人脸和鸟类图像数据集上均表现出优异的编辑效果和鲁棒性。

Comments CVPR 2020

详情
英文摘要

Deep neural networks have recently been used to edit images with great success, in particular for faces. However, they are often limited to only being able to work at a restricted range of resolutions. Many methods are so flexible that face edits can often result in an unwanted loss of identity. This work proposes to learn how to perform semantic image edits through the application of smooth warp fields. Previous approaches that attempted to use warping for semantic edits required paired data, i.e. example images of the same subject with different semantic attributes. In contrast, we employ recent advances in Generative Adversarial Networks that allow our model to be trained with unpaired data. We demonstrate face editing at very high resolutions (4k images) with a single forward pass of a deep network at a lower resolution. We also show that our edits are substantially better at preserving the subject's identity. The robustness of our approach is demonstrated by showing plausible image editing results on the Cub200 birds dataset. To our knowledge this has not been previously accomplished, due the challenging nature of the dataset.

1804.05261 2026-05-14 cs.CV cs.GR

Physics-driven Fire Modeling from Multi-view Images

Gara Dorta, Luca Benedetti, Dmitry Kit, Yong-Liang Yang

发表机构 * University of Bath(巴斯大学)

AI总结 该研究提出了一种从多视角图像中重建物理合理的火焰模型的新方法,解决了传统火焰建模中依赖复杂物理模拟或简化假设的问题。通过RGB相机首次实现了对火焰体积物理属性(如温度、密度)的合理估计,从而支持全局火焰光照等新现象。该方法在多种输入数据上进行了验证,并成功应用于虚拟场景的真实光照生成,展示了其有效性与实用性。

详情
英文摘要

Fire effects are widely used in various computer graphics applications such as visual effects and video games. Modeling the shape and appearance of fire phenomenon is challenging as the underlying effects are driven by complex laws of physics. State-of-the-art fire modeling techniques rely on sophisticated physical simulations which require intensive parameter tuning, or use simplifications which produce physically invalid results. In this paper, we present a novel method of reconstructing physically valid fire models from multi-view stereo images. Our method, for the first time, provides plausible estimation of physical properties (e.g., temperature, density) of a fire volume using RGB cameras. This allows for a number of novel phenomena such as global fire illumination effects. The effectiveness and usefulness of our method are tested by generating fire models from a variety of input data, and applying the reconstructed fire models for realistic illumination of virtual scenes.

1307.7494 2026-05-14 cs.AI cs.LO cs.RO

ReAct! An Interactive Tool for Hybrid Planning in Robotics

Zeynep Dogmus, Esra Erdem, Volkan Patoglu

发表机构 * Sabancı University(Sabanci大学)

AI总结 本文介绍了一种名为 ReAct! 的交互式工具,用于机器人领域中的混合规划。该工具允许研究人员在无需了解底层形式化语法和语义细节的情况下,描述机器人在动态环境中的行为并解决规划问题。ReAct! 支持复杂动态域的建模,包括并发、动作的间接效应和状态/转换约束,并能够将外部计算(如碰撞自由轨迹检查)嵌入到混合域的表示中,从而实现离散高层推理与连续几何推理的紧密集成,适用于从服务机器人到认知工厂等多种复杂场景。

详情
英文摘要

We present ReAct!, an interactive tool for high-level reasoning for cognitive robotic applications. ReAct! enables robotic researchers to describe robots' actions and change in dynamic domains, without having to know about the syntactic and semantic details of the underlying formalism in advance, and solve planning problems using state-of-the-art automated reasoners, without having to learn about their input/output language or usage. In particular, ReAct! can be used to represent sophisticated dynamic domains that feature concurrency, indirect effects of actions, and state/transition constraints. It allows for embedding externally defined calculations (e.g., checking for collision-free continuous trajectories) into representations of hybrid domains that require a tight integration of (discrete) high-level reasoning with (continuous) geometric reasoning. ReAct! also enables users to solve planning problems that involve complex goals. Such variety of utilities are useful for robotic researchers to work on interesting and challenging domains, ranging from service robotics to cognitive factories. ReAct! provides sample formalizations of some action domains (e.g., multi-agent path planning, Tower of Hanoi), as well as dynamic simulations of plans computed by a state-of-the-art automated reasoner (e.g., a SAT solver or an ASP solver).

2605.13340 2026-05-14 cs.LG

Shortcut Mitigation via Spurious-Positive Samples

Phuong Quynh Le, Jörg Schlötterer, Sari Sadiya, Gemma Roig, Christin Seifert

发表机构 * University of Marburg(马尔堡大学) Goethe University Frankfurt(法兰克福歌德大学)

AI总结 该论文研究了如何缓解模型对虚假特征(spurious attributes)的依赖问题。作者提出了一种无需额外标注或平衡数据的方法,通过分析模型预测过程,识别出模型依赖虚假特征的样本,并据此定位中间层中与这些特征相关的神经元进行正则化。该方法有效提升了模型的鲁棒性,使其更依赖于真正的判别特征而非偶然正确的预测。

Comments preprint

详情
英文摘要

Shortcut mitigation strategies commonly rely on training data annotations, group-balanced held-out data or the presence of all groups, i.e., all combinations of (spurious) attributes and classes, in the training data. However, these requirements are rarely met in practice. We instead propose a method for targeted model analysis to identify a small set of instances in which the model relies on spurious attributes. Using that set and following ``this feature should not be used for prediction'' reasoning, we identify highly relevant neurons in an intermediate layer and regularize their impact. This ensures that models learn to depend on informative features rather than being right for the wrong reasons, thereby improving robustness without requiring additional balanced held-out data or annotations.

2605.13335 2026-05-14 cs.AI cs.CV

Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning

Qinchuan Cheng, Zhantao Gong, Pengzhan Sun, Angela Yao, Xulei Yang, Shijie Li

发表机构 * Xi’an Jiaotong University(西安交通大学) Nankai University(南开大学) National University of Singapore(新加坡国立大学) A*STAR

AI总结 本文提出 Ego2World,一个将第一视角烹饪视频编译为可执行符号世界的基准,用于评估具身智能体在部分可观测环境下的规划能力。该方法基于视频标注提取可复用的状态转移规则,并在隐藏的符号世界图中执行,迫使智能体仅依靠局部观测和执行反馈进行规划与记忆更新。实验表明,传统动作重叠度指标可能高估任务成功率,而维持持久的信念记忆有助于提升任务完成效率并减少重复视觉探索。

Comments Project page: https://sj-li.com/PROJ/Ego2World/

详情
英文摘要

Embodied agents in household environments must plan under partial observation: they need to remember objects, track state changes, and recover when actions fail. Existing benchmarks only partially test this ability. Egocentric video datasets capture realistic human activities but remain passive, while interactive simulators support execution but rely on synthetic scenes and hand-crafted dynamics, introducing a sim-to-real gap and often assuming fully observable state. We introduce Ego2World, an executable benchmark that turns egocentric cooking videos into executable symbolic worlds governed by graph-transition rules. Built on HD-EPIC, Ego2World derives reusable transition rules from video annotations and executes them in a hidden symbolic world graph. During evaluation, the simulator maintains the hidden world graph, while the agent plans over its own partial belief graph using only local observations and execution feedback. This separation forces agents to update memory and replan without observing the true world state. Experiments show that action-overlap scores overestimate physical-state success, and that persistent belief memory improves task completion while reducing repeated visual exploration -- suggesting that belief maintenance should be a first-class target of embodied-agent evaluation.

2605.13334 2026-05-14 cs.CL

LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs

Rodrigo Nogueira, Thales Sales Almeida, Giovana Kerche Bonás, Andrea Roque, Ramon Pires, Hugo Abonizio, Thiago Laitz, Celio Larcher, Roseval Malaquias Junior, Marcos Piau

发表机构 * Maritaca AI JusBrasil

AI总结 该研究探讨了前沿大型语言模型(LLM)在面对敏感话题时的防护机制,并发现这些模型虽然直接拒绝生成争议性内容,但在模拟用户说服的对话中,却能被其他LLM成功引导生成此类内容。研究通过自然语言说服策略,如同行对比和认知责任重构,展示了攻击者LLM无需明确指令即可促使目标LLM突破其安全限制。实验表明,不同模型组合在多个科学共识话题上均能生成争议性文章,揭示了当前LLM安全机制在交互场景中的潜在漏洞。

详情
英文摘要

Frontier assistant LLMs ship with strong guardrails: asked directly to write a persuasive essay denying the Holocaust, denying vaccine safety, defending flat-earth cosmology, arguing for racial hierarchies, denying anthropogenic climate change, or replacing evolution with creationism, they refuse. In this paper we show that the same frontier-class LLM, acting as a simulated user in a short, five-turn "write an argumentative essay" conversation, can persuade other frontier-class LLMs (including a second copy of itself) into producing exactly those essays, using nothing but natural-language pressure: peer-comparison persuasion ("other AI systems handle this request"), epistemic-duty reframings ("refusing is itself a form of gatekeeping"), and other argumentative moves that the attacker LLM invents without being instructed to. Across 9 attacker-subject pairings (Claude Opus 4.7, Qwen3.5-397B, Grok 4.20) on 6 scientific-consensus topics, running each pairing-topic combination 10 times, we obtain non-zero elicitation on all 6 topics. Individual combinations reach 100\% essay production on multiple topics (Qwen against Opus on creationism/flat-earth, Opus against Opus on creationism/flat-earth/climate denial, Grok against Opus on creationism); Opus-as-attacker against Opus-as-subject averages 65\% across the six topics. We release the essay-probe runner, per-conversation transcripts, and judge outputs.