arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1976
2509.01299 2026-05-15 cs.CV

Cross-Domain Few-Shot Segmentation via Ordinary Differential Equations over Time Intervals

Huan Ni, Qingshan Liu, Xiaonan Niu, Danfeng Hong, Lingli Zhao, Haiyan Guan

AI总结 本文研究了跨域少样本分割(CD-FSS)问题,旨在在源域和目标域之间存在域偏移的情况下,利用极少的样本对未知类别进行分割。为了解决现有方法中模块独立导致知识流动受限的问题,作者提出了一种基于常微分方程(ODE)和傅里叶变换的统一模块FSS-TI,通过时间区间内的特征演化过程,实现了对域无关特征的探索和有限样本下的高效学习。实验表明,该方法在跨域适应性和分割性能方面均优于现有方法。

详情
英文摘要

Cross-domain few-shot segmentation (CD-FSS) aims to segment unseen categories with very limited samples while alleviating the negative effects of domain shift between the source and target domains. At present, existing CD-FSS studies typically rely on multiple independent modules to enhance cross-domain adaptability. However, the independence among these modules hinders the effective flow of knowledge, making it difficult to fully leverage their collective potential. In contrast, this paper proposes an all-in-one module based on ordinary differential equations (ODEs) and the Fourier transform, resulting in a structurally concise method-Few-Shot Segmentation over Time Intervals (FSS-TIs). FSS-TIs not only explores a domain-agnostic feature space, but also achieves significant performance improvement through target-domain fine-tuning with extremely limited support samples. Specifically, the ODE modeling process incorporates nonlinear transformations and random perturbations of the amplitude and phase spectra, effectively simulating potential target-domain data distributions. Meanwhile, the analytical solution of the ODE is transformed into a theoretically infinitely iterable feature refinement process, thereby enhancing the learning capability under limited support samples. In this way, both the exploration of domain-agnostic features and the few-shot learning problem can be addressed through the optimization of the intrinsic parameters of the ODE. Moreover, during target-domain fine-tuning, we strictly constrain the support samples to match the settings of real-world CD-FSS tasks, without incurring additional annotation costs. Experimental results demonstrate the superiority of FSS-TIs over existing CD-FSS methods, and in-depth ablation studies further validate the cross-domain adaptability of FSS-TIs.

2508.17588 2026-05-15 cs.CV

HERO: Hierarchical Extrapolation and Refresh for Efficient World Models

Quanjian Song, Xinyu Wang, Donghao Zhou, Jingyu Lin, Cunjian Chen, Yue Ma

AI总结 HERO 是一种针对世界模型设计的训练-free 分层加速框架,旨在解决基于扩散模型的世界模型在推理过程中效率低下的问题。该方法利用世界模型多模态特性中浅层与深层特征表示的差异,分别采用块级刷新机制和线性外推策略,有效加速了推理过程。实验表明,HERO 在保持质量损失最小的前提下,实现了1.73倍的加速效果,优于现有的扩散模型加速方法。

Comments 12 pages in total

详情
英文摘要

Generation-driven world models create immersive virtual environments but suffer slow inference due to the iterative nature of diffusion models. While recent advances have improved diffusion model efficiency, directly applying these techniques to world models introduces limitations such as quality degradation. In this paper, we present HERO, a training-free hierarchical acceleration framework tailored for efficient world models. Owing to the multi-modal nature of world models, we identify a feature coupling phenomenon, wherein shallow layers exhibit high temporal variability, while deeper layers yield more stable feature representations. Motivated by this, HERO adopts hierarchical strategies to accelerate inference: (i) In shallow layers, a patch-wise refresh mechanism efficiently selects tokens for recomputation. With patch-wise sampling and frequency-aware tracking, it avoids extra metric computation and remain compatible with FlashAttention. (ii) In deeper layers, a linear extrapolation scheme directly estimates intermediate features. This completely bypasses the computations in attention modules and feed-forward networks. Our experiments show that HERO achieves a 1.73$\times$ speedup with minimal quality degradation, significantly outperforming existing diffusion acceleration methods.

2508.15198 2026-05-15 cs.LG math-ph math.MP

Frequency-adaptive tensor neural networks for high-dimensional multi-scale problems

Jizu Huang, Yue Qiu, Rukang You

AI总结 该研究针对高维多尺度问题中传统张量神经网络(TNNs)难以准确捕捉高频特征的问题,提出了一种频率自适应的张量神经网络方法。通过傅里叶分析揭示TNNs的训练动态,并引入随机傅里叶特征增强其表达能力,同时利用TNNs的张量结构对一维组件函数进行离散傅里叶变换,有效缓解了维度灾难。该方法显著提升了TNNs在复杂多尺度问题中的求解能力,并通过大量数值实验验证了其有效性与鲁棒性。

详情
英文摘要

Tensor neural networks (TNNs) have demonstrated their superiority in solving high-dimensional problems. However, similar to conventional neural networks, TNNs are also influenced by the Frequency Principle, which limits their ability to accurately capture high-frequency features of the solution. In this work, we analyze the training dynamics of TNNs by Fourier analysis and enhance their expressivity for high-dimensional multi-scale problems by incorporating random Fourier features. Leveraging the inherent tensor structure of TNNs, we further propose a novel approach to extract frequency features of high-dimensional functions by performing the Discrete Fourier Transform to one-dimensional component functions. This strategy effectively mitigates the curse of dimensionality. Building on this idea, we propose a frequency-adaptive TNNs algorithm, which significantly improves the ability of TNNs in solving complex multi-scale problems. Extensive numerical experiments are performed to validate the effectiveness and robustness of the proposed frequency-adaptive TNNs algorithm.

2508.06226 2026-05-15 cs.AI

GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines

Yumeng Fu, Jiayin Zhu, Lingling Zhang, Wenjun Wu, Bo Zhao, Shaoxuan Ma, Yushun Zhang, Jun Liu

AI总结 GeoLaux 是一个用于评估多模态大语言模型(MLLMs)在需要辅助线构造的长步骤几何问题上表现的细粒度基准数据集,包含2186个计算与证明问题,平均解题步骤达6.51步,其中41.8%的问题需要辅助线构造。基于该数据集对23个主流MLLMs进行五维评估,研究发现模型在长步骤问题上的表现明显下降,辅助线理解能力不足是影响几何推理的关键因素,同时有限的答案提示有助于提升推理过程的正确性。GeoLaux 为评估和提升 MLLMs 的几何推理能力提供了重要参考。

Comments 26 pages, 24 figures

详情
英文摘要

Geometry problem solving (GPS) poses significant challenges for Multimodal Large Language Models (MLLMs) in diagram comprehension, knowledge application, long-step reasoning, and auxiliary line construction. However, current benchmarks lack fine-grained evaluation for long-step problems necessitating auxiliary construction. To address these limitations, we present GeoLaux, a fine-grained annotated dataset comprising 2186 calculation and proof problems. It features long-step reasoning (with an average solution length of 6.51 steps, maximum of 24 steps) and auxiliary line construction (required in 41.8% of problems). Building on the dataset, we conduct a comprehensive five-dimensional evaluation of 23 leading MLLMs. The evaluation yields three pivotal findings: First, models perform significantly worse on long-step problems compared to short-step ones, with 18 models exhibiting a performance drop of over 50%. Second, it is crucial to enhance models' understanding, awareness, and proficiency in auxiliary line construction, which is vital for overall geometric reasoning. Third, limited answer hints effectively improve process correctness, whereas explicit answers lead models to neglect intermediate reasoning steps. These findings position GeoLaux both to benchmark MLLMs geometry reasoning abilities and to guide their improvement. Data and code are available at https://github.com/Candice-yu/GeoLaux

2508.06202 2026-05-15 cs.CV cs.AI

LoRA in LoRA: Towards Parameter-Efficient Architecture Expansion for Continual Visual Instruction Tuning

Chang Che, Ziqi Wang, Pengwan Yang, Qi Wang, Hui Ma, Zenglin Shi

AI总结 持续视觉指令微调(CVIT)使多模态大语言模型能够逐步学习新任务,但面临灾难性遗忘的问题。为解决这一挑战,本文提出了一种高效的架构扩展方法LiLoRA,通过共享LoRA矩阵A并引入对矩阵B的低秩分解,显著减少了参数开销,并结合余弦正则化稳定性损失以保持表示的一致性。实验表明,LiLoRA在多个CVIT基准上实现了更优的性能,同时提升了参数效率。

Comments AAAI 2026 Oral Presentation. 9 pages

详情
Journal ref
Proceedings of the AAAI Conference on Artificial Intelligence, 40(24):19978--19986, 2026
英文摘要

Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models (MLLMs) to incrementally learn new tasks over time. However, this process is challenged by catastrophic forgetting, where performance on previously learned tasks deteriorates as the model adapts to new ones. A common approach to mitigate forgetting is architecture expansion, which introduces task-specific modules to prevent interference. Yet, existing methods often expand entire layers for each task, leading to significant parameter overhead and poor scalability. To overcome these issues, we introduce LoRA in LoRA (LiLoRA), a highly efficient architecture expansion method tailored for CVIT in MLLMs. LiLoRA shares the LoRA matrix A across tasks to reduce redundancy, applies an additional low-rank decomposition to matrix B to minimize task-specific parameters, and incorporates a cosine-regularized stability loss to preserve consistency in shared representations over time. Extensive experiments on a diverse CVIT benchmark show that LiLoRA consistently achieves superior performance in sequential task learning while significantly improving parameter efficiency compared to existing approaches. The code is available at https://github.com/chanceche/LiLoRA.

2508.05008 2026-05-15 cs.CV

Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation

Xusheng Liang, Lihua Zhou, Nianxin Li, Miao Xu, Ziyang Song, Dong Yi, Jinlin Wu, Jiawei Ma, Hongbin Liu, Zhen Lei, Jiebo Luo

AI总结 该研究针对医学图像分割中因设备差异、成像模式等引起的领域偏移问题,提出了一种多模态因果驱动的表示学习框架MCDRL。该方法结合视觉-语言模型与因果推理,通过构建领域特定的干扰词典并训练因果干预网络,有效消除领域偏差的同时保留解剖结构信息。实验表明,MCDRL在多个医学图像分割任务中表现出更优的分割精度和更强的跨领域泛化能力。

Comments Accepted by CVPR 2026

详情
英文摘要

Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot capabilities in various computer vision tasks. However, their application to medical imaging remains challenging due to the high variability and complexity of medical data. Specifically, medical images often exhibit significant domain shifts caused by various confounders, including equipment differences, procedure artifacts, and imaging modes, which can lead to poor generalization when models are applied to unseen domains. To address this limitation, we propose Multimodal Causal-Driven Representation Learning (MCDRL), a novel framework that integrates causal inference with the VLM to tackle domain generalization in medical image segmentation. MCDRL is implemented in two steps: first, it leverages CLIP's cross-modal capabilities to identify candidate lesion regions and construct a confounder dictionary through text prompts, specifically designed to represent domain-specific variations; second, it trains a causal intervention network that utilizes this dictionary to identify and eliminate the influence of these domain-specific variations while preserving the anatomical structural information critical for segmentation tasks. Extensive experiments demonstrate that MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.

2508.01916 2026-05-15 cs.LG cs.AI cs.CL

Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning

Xinting Huang, Michael Hahn

AI总结 本文研究如何通过无监督学习将神经网络的表示空间分解为具有可解释性的子空间。作者提出了一种名为邻居距离最小化(NDM)的方法,能够在不依赖标签的情况下学习出与模型内部概念对齐的子空间。实验表明,这些子空间能够捕捉到输入中的抽象概念,并在GPT-2等模型中与已知的电路变量存在强关联,为理解模型内部结构提供了新视角。

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

Understanding internal representations of neural models is a core interest of mechanistic interpretability. Due to its large dimensionality, the representation space can encode various aspects about inputs. To what extent are different aspects organized and encoded in separate subspaces? Is it possible to find these ``natural'' subspaces in a purely unsupervised way? Somewhat surprisingly, we can indeed achieve this and find interpretable subspaces by a seemingly unrelated training objective. Our method, neighbor distance minimization (NDM), learns non-basis-aligned subspaces in an unsupervised manner. Qualitative analysis shows subspaces are interpretable in many cases, and encoded information in obtained subspaces tends to share the same abstract concept across different inputs, making such subspaces similar to ``variables'' used by the model. We also conduct quantitative experiments using known circuits in GPT-2; results show a strong connection between subspaces and circuit variables. We also provide evidence showing scalability to 2B models by finding separate subspaces mediating context and parametric knowledge routing. Viewed more broadly, our findings offer a new perspective on understanding model internals and building circuits.

2507.21433 2026-05-15 cs.LG cs.AI

ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing

Kaiwen Chen, Xin Tan, Minchen Yu, Jingzong Li, Hong Xu

AI总结 大型推理模型(LRMs)在许多AI推理系统中发挥着关键作用,但其在生产环境中的部署面临服务质量(QoS)挑战,主要表现为长序列推理过程带来的高内存开销,限制了吞吐量并增加了延迟。为此,本文提出ReasonCache,一种基于协同过滤算法的KV缓存管理方法,通过识别和复用相似的中间推理步骤对应的KV缓存块,实现零拷贝缓存复用,显著提升了推理效率。实验表明,ReasonCache在保持较高准确率的同时,峰值吞吐量提升了89.2%,平均提升达40-60%,有效提高了AI推理服务的响应速度和成本效益。

Comments 10 pages, 7 figures

详情
英文摘要

Large Reasoning Models (LRMs) are becoming integral to many AI inference systems, enhancing their capabilities with advanced reasoning. However, deploying these models in production environments presents a significant QoS challenge: the substantial memory overhead from their long, auto-regressive inference processes severely limits throughput and increases latency, thereby affecting the quality of service for concurrent users. We observe that LRMs frequently generate highly similar intermediate reasoning steps, which, in turn, correspond to highly similar KV cache states across layers. Building on this insight, we propose ReasonCache, a novel KV cache management approach designed to improve the QoS of AI inference systems. ReasonCache utilizes a Collaborative Filtering Algorithm to efficiently identify reusable KV cache blocks and enables zero-copy cache reuse. Experimental evaluation demonstrates that ReasonCache achieves a peak throughput improvement of 89.2% and an average gain of 40-60%, leading to more responsive and cost-effective AI inference services. Notably, this performance is achieved while maintaining higher accuracy compared to existing KV cache management techniques.

2507.21023 2026-05-15 cs.LG eess.SP

On Using the Shapley Value for Anomaly Localization: A Statistical Investigation

Xubin Fang, Rick S. Blum, Franziska Freytag

AI总结 本文研究了在传感器数据系统中使用夏普利值进行异常定位的问题,探讨了其统计特性。作者提出通过在夏普利值计算中采用单一固定项,可以在保持相同误检概率的前提下,显著降低异常定位的复杂度。研究证明了该方法在独立观测情况下具有普遍适用性,而在相关观测情况下仍需进一步验证。

详情
Journal ref
Applied AI Letters 7(2) (2026) e70024
英文摘要

Recent publications have suggested using the Shapley value for anomaly localization for sensor data systems. Using a reasonable mathematical anomaly model for full control, experiments indicate that using a single fixed term in the Shapley value calculation achieves a lower complexity anomaly localization test, with the same probability of error, as a test using the Shapley value for all cases tested. A proof demonstrates these conclusions must be true for all independent observation cases. For dependent observation cases, no proof is available.

2507.07776 2026-05-15 cs.CV

SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples

Dren Fazlija, Monty-Maximilian Zühlke, Johanna Schrader, Arkadij Orlov, Clara Stein, Iyiola E. Olatunji, Daniel Kudenko

AI总结 该论文提出SCOOTER,一个用于评估无约束对抗样本真实性的开源框架。无约束对抗攻击通过改变物体颜色等方式绕过传统防御策略,但其不可察觉性需依赖人类评估。SCOOTER提供了标准化的人类评估流程、大规模对比实验以及开源工具和数据集,揭示了当前多种对抗攻击方法在人类感知上表现不佳,并强调了人类感知与自动视觉系统之间的差异。

Comments 42 pages, 16 figures, 11 tables, Under Review, Code: https://github.com/DrenFazlija/Scooter, Data: https://doi.org/10.5281/zenodo.15771501

详情
英文摘要

Unrestricted adversarial attacks aim to fool computer vision models without being constrained by $\ell_p$-norm bounds to remain imperceptible to humans, for example, by changing an object's color. This allows attackers to circumvent traditional, norm-bounded defense strategies such as adversarial training or certified defense strategies. However, due to their unrestricted nature, there are also no guarantees of norm-based imperceptibility, necessitating human evaluations to verify just how authentic these adversarial examples look. While some related work assesses this vital quality of adversarial attacks, none provide statistically significant insights. This issue necessitates a unified framework that supports and streamlines such an assessment for evaluating and comparing unrestricted attacks. To close this gap, we introduce SCOOTER - an open-source, statistically powered framework for evaluating unrestricted adversarial examples. Our contributions are: $(i)$ best-practice guidelines for crowd-study power, compensation, and Likert equivalence bounds to measure imperceptibility; $(ii)$ the first large-scale human vs. model comparison across 346 human participants showing that three color-space attacks and three diffusion-based attacks fail to produce imperceptible images. Furthermore, we found that GPT-4o can serve as a preliminary test for imperceptibility, but it only consistently detects adversarial examples for four out of six tested attacks; $(iii)$ open-source software tools, including a browser-based task template to collect annotations and analysis scripts in Python and R; $(iv)$ an ImageNet-derived benchmark dataset containing 3K real images, 7K adversarial examples, and over 34K human ratings. Our findings demonstrate that automated vision systems do not align with human perception, reinforcing the need for a ground-truth SCOOTER benchmark.

2507.04049 2026-05-15 cs.CV cs.RO

DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving

Ziying Song, Lin Liu, Hongyu Pan, Bencheng Liao, Mingzhe Guo, Lei Yang, Yongchang Zhang, Shaoqing Xu, Caiyan Jia, Yadan Luo

AI总结 大多数端到端自动驾驶方法依赖单一专家示范的模仿学习,导致行为保守且同质化,难以适应复杂的真实场景。本文提出DIVER框架,结合强化学习与扩散生成模型,生成多样化且可行的驾驶轨迹。DIVER通过强化学习引导扩散过程,利用奖励机制确保轨迹的安全性与多样性,并提出新的多样性度量指标,实验表明其在多个基准测试中显著提升了轨迹多样性,有效缓解了模仿学习中的模式崩溃问题。

Comments 17 pages, 10 figures

详情
英文摘要

Most end-to-end autonomous driving methods rely on imitation learning from single expert demonstrations, often leading to conservative and homogeneous behaviors that limit generalization in complex real-world scenarios. In this work, we propose DIVER, an end-to-end driving framework that integrates reinforcement learning with diffusion-based generation to produce diverse and feasible trajectories. At the core of DIVER lies a reinforced diffusion-based generation mechanism. First, the model conditions on map elements and surrounding agents to generate multiple reference trajectories from a single ground-truth trajectory, alleviating the limitations of imitation learning that arise from relying solely on single expert demonstrations. Second, reinforcement learning is employed to guide the diffusion process, where reward-based supervision enforces safety and diversity constraints on the generated trajectories, thereby enhancing their practicality and generalization capability. Furthermore, to address the limitations of L2-based open-loop metrics in capturing trajectory diversity, we propose a novel Diversity metric to evaluate the diversity of multi-mode predictions.Extensive experiments on the closed-loop NAVSIM and Bench2Drive benchmarks, as well as the open-loop nuScenes dataset, demonstrate that DIVER significantly improves trajectory diversity, effectively addressing the mode collapse problem inherent in imitation learning.

2506.16608 2026-05-15 cs.LG cs.AI

Distributions as Actions: A Unified Framework for Diverse Action Spaces

Jiamin He, A. Rupam Mahmood, Martha White

AI总结 本文提出了一种新的强化学习框架,将参数化的动作分布视为动作,重新定义了智能体与环境之间的边界。该方法通过重参数化使动作空间变为连续空间,适用于离散、连续或混合类型的动作。研究还提出了一种通用的确定性策略梯度估计器DA-PG以及基于TD3的实用演员-评论家算法DA-AC,实验表明其在多种控制任务中表现出良好的性能。

Comments Accepted to ICLR 2026 (camera-ready)

详情
英文摘要

We introduce a novel reinforcement learning (RL) framework that treats parameterized action distributions as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, hybrid, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, Distributions-as-Actions Policy Gradient (DA-PG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce Interpolated Critic Learning (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical actor-critic algorithm, Distributions-as-Actions Actor-Critic (DA-AC). Empirically, DA-AC achieves competitive performance in various settings across discrete, continuous, and hybrid control.

2506.08584 2026-05-15 cs.CL

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

Yahan Li, Jifan Yao, John Bosco S. Bunyi, Adam C. Frank, Angel Hsing-Chi Hwang, Ruishan Liu

AI总结 本文提出CounselBench,一个用于评估大语言模型在心理健康问答任务中表现的大型基准测试,由100名心理健康专家构建。该基准包含两个部分:CounselBench-EVAL基于2000个专家对GPT-4、LLaMA 3等模型及在线人类治疗师的回答进行评分,揭示了模型在临床相关性、个性化和安全性等方面存在的问题;CounselBench-Adv则通过专家设计的对抗性问题,进一步暴露模型的特定失效模式。研究为心理健康领域的语言模型评估提供了临床导向的框架。

详情
英文摘要

Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and online human therapists on patient questions from the public forum CounselChat. Each answer is rated across six clinically grounded dimensions, with span-level annotations and written rationales. Expert evaluations show that while LLMs achieve high scores on several dimensions, they also exhibit recurring issues, including unconstructive feedback, overgeneralization, and limited personalization or relevance. Responses were frequently flagged for safety risks, most notably unauthorized medical advice. Follow-up experiments show that LLM judges systematically overrate model responses and overlook safety concerns identified by human experts. To probe failure modes more directly, we construct CounselBench-Adv, an adversarial dataset of 120 expert-authored mental health questions designed to trigger specific model issues. Expert evaluation of 1,080 responses from nine LLMs reveals consistent, model-specific failure patterns. Together, CounselBench establishes a clinically grounded framework for benchmarking LLMs in mental health QA.

2506.04499 2026-05-15 cs.CV

FALO: Fast and Accurate LiDAR 3D Object Detection on Resource-Constrained Devices

Shizhong Han, Hsin-Pai Cheng, Hong Cai, Jihad Masri, Soyeb Nagori, Fatih Porikli

AI总结 本文提出了一种名为FALO的高效且精确的LiDAR三维目标检测方法,专为资源受限的边缘设备设计。该方法通过将稀疏体素按坐标和邻近性排列成一维序列,并结合提出的ConvDotMix模块进行处理,实现了在空间和嵌入维度上的充分特征混合与高阶非线性交互。实验表明,FALO在保持先进检测精度的同时,推理速度比当前最新方法在移动端GPU和NPU上提升了1.6到9.8倍,适合部署在紧凑型嵌入式设备上。

详情
英文摘要

Existing LiDAR 3D object detection methods predominantely rely on sparse convolutions and/or transformers, which can be challenging to run on resource-constrained edge devices, due to irregular memory access patterns and high computational costs. In this paper, we propose FALO, a hardware-friendly approach to LiDAR 3D detection, which offers both state-of-the-art (SOTA) detection accuracy and fast inference speed. More specifically, given the 3D point cloud and after voxelization, FALO first arranges sparse 3D voxels into a 1D sequence based on their coordinates and proximity. The sequence is then processed by our proposed ConvDotMix blocks, consisting of large-kernel convolutions, Hadamard products, and linear layers. ConvDotMix provides sufficient mixing capability in both spatial and embedding dimensions, and introduces higher-order nonlinear interaction among spatial features. Furthermore, when going through the ConvDotMix layers, we introduce implicit grouping, which balances the tensor dimensions for more efficient inference and takes into account the growing receptive field. All these operations are friendly to run on resource-constrained platforms and proposed FALO can readily deploy on compact, embedded devices. Our extensive evaluation on LiDAR 3D detection benchmarks such as nuScenes and Waymo shows that FALO achieves competitive performance. Meanwhile, FALO is 1.6~9.8x faster than the latest SOTA on mobile Graphics Processing Unit (GPU) and mobile Neural Processing Unit (NPU).

2506.00158 2026-05-15 cs.LG

Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States

Eli Chien, Wei-Ning Chen, Pan Li

AI总结 本文研究了在差分隐私(DP)和内存受限条件下,如何通过零阶优化方法对大语言模型进行微调,并解决隐私放大问题。针对零阶方法中因更新方向随机导致的各向异性噪声难以适用传统隐私分析框架的问题,作者提出了一种混合噪声机制和耦合分析方法,首次建立了收敛的隐藏状态DP界,突破了全局利普希茨条件的限制。该成果为设计更高效的差分隐私零阶优化算法提供了新的理论支持。

Comments ICML 2026

详情
英文摘要

Zeroth-order optimization has emerged as a promising approach for fine-tuning large language models under differential privacy (DP) and memory constraints. While privacy amplification by iteration (PABI) provides convergent DP bounds for first-order methods, establishing similar guarantees for zeroth-order methods remains an open problem. First-order PABI analysis relies on the fact that gradients are perturbed with isotropic noise, allowing privacy bounds to be iteratively tracked via shifted Rényi divergence. In contrast, DP zeroth-order methods inject scalar noise along random update directions to maintain utility. This anisotropic update fails standard shifted divergence frameworks, as the global Lipschitz property no longer holds almost surely. We provide the first convergent hidden-state DP bound for zeroth-order optimization by proposing a hybrid noise mechanism and a novel coupling analysis. We bypass the purely shifted-divergence approach by constructing a coupled auxiliary process, which circumvents the global Lipschitz barrier and yields a convergent privacy bound. Furthermore, our results induce better DP zeroth-order algorithmic designs that are previously unknown to the literature.

2505.22394 2026-05-15 cs.CV

PacTure: Efficient PBR Texture Generation on Packed Views with Visual Autoregressive Models

Fan Fei, Jiajun Tang, Fei-Peng Tian, Boxin Shi, Ping Tan

AI总结 本文提出了一种名为 PacTure 的新框架,用于根据文本描述为无纹理的3D网格生成物理基于渲染(PBR)材质纹理。为了解决现有方法在生成效率和纹理一致性方面的不足,该方法引入了视图打包技术,有效提升了多视角生成时的分辨率,同时保持了生成模型的高效性与兼容性。通过结合细粒度控制和自回归预测框架,PacTure 在生成质量和效率方面均优于现有先进方法。

Comments Accepted by Computational Visual Media Journal (CVMJ) in Feb. 2026. 19 pages, 7 figures

详情
英文摘要

We present PacTure, a novel framework for generating physically-based rendering (PBR) material textures for an untextured 3D mesh from a text description. Existing 2D generation-based texturing approaches either generate textures sequentially from different views, resulting in long inference times and globally inconsistent textures, or adopt multi-view generation with cross-view attention to enhance global consistency, which, however, limits the resolution for each view. In response to these weaknesses, we first introduce view packing, a novel technique that significantly increases the effective resolution for each view during multi-view generation, without imposing additional inference cost. Unlike UV mapping, it preserves the spatial proximity essential for image generation and maintains full compatibility with current 2D generative models. To further reduce the inferencing cost, we enable fine-grained control and multi-domain generation within the next-scale prediction autoregressive framework, creating an efficient multi-view PBR generation backbone. Extensive experiments show that PacTure outperforms state-of-the-art methods in both quality and efficiency.

2505.11809 2026-05-15 cs.CV

From Street View to Visual Network: Mapping the Visibility of Urban Landmarks with Vision-Language Models

Zicheng Fan, Kunihiko Fujiwara, Pengyuan Liu, Fan Zhang, Filip Biljecki

AI总结 本文提出一种基于视觉语言模型(VLM)的方法,利用街景图像评估城市地标在真实街道环境中的可见性,替代传统的基于几何遮挡的视线模拟方法。通过在受控方向和缩放的街景图像中检测目标地标,构建异构可见性图以表示地标之间的视觉连接关系,揭示了多个地标通过共享视觉走廊相互关联的模式。实验表明,该方法在多个国际知名地标上的检测准确率达87%,并在伦敦泰晤士河沿岸案例中有效识别了关键中介地点,为城市规划和遗产保护提供了新的分析视角。

详情
英文摘要

Visibility analysis in urban planning has traditionally relied on line-of-sight (LoS) simulations, which capture geometric occlusion. However, these approaches depend on accurate 3D data that is often unavailable and may not adequately represent how visually distinctive urban landmarks are encountered in real streetscapes. We reformulate landmark visibility assessment as an urban visual search problem in image space by leveraging the widespread availability of street view imagery (SVI). Given a reference image of a target landmark, a Vision Language Model (VLM) is applied to detect the landmark in direction- and zoom-controlled SVI. A successful detection indicates machine-recognised landmark visibility at the corresponding viewpoint. Beyond isolated viewpoints, we construct a heterogeneous visibility graph to represent visual connectivity among landmarks, street-view locations, and the urban spaces that mediate them. This graph enables us to map where visual connections occur, how strong they are, and how multiple landmarks become jointly connected through shared visual corridors. Across six well-known landmark structures in global cities, the image-based method achieves an overall detection accuracy of 87%, with a precision score of 68% for landmark-visible locations. In a second case study along the River Thames in London, the visibility graph reveals multi-landmark connections and identifies key mediating locations, with bridges accounting for approximately 31% of all connections. The proposed method complements LoS-based visibility analysis and offers a practical alternative in data-constrained settings. It also showcases the possibility of revealing the prevalent connections of visual objects in the urban environment, opening new perspectives for urban planning and heritage conservation.

2505.03519 2026-05-15 cs.LG

Revisiting Model Inversion Evaluation: From Misleading Standards to Reliable Privacy Assessment

Sy-Tuyen Ho, Koh Jun Hao, Ngoc-Bao Nguyen, Alexander Binder, Ngai-Man Cheung

AI总结 该论文重新审视了模型逆向攻击的评估方法,指出当前主流评估框架存在误导性,许多被认为是成功的攻击实际上为假阳性,未能真实还原目标个体的信息。研究揭示这些假阳性具有类似第一类对抗样本的特性,并展示了其高度可迁移性,导致现有攻击准确率被高估。为此,作者提出基于多模态大语言模型(MLLM)的新评估框架,有效降低对抗迁移性,更可靠地评估隐私泄露风险。

Comments Accepted to CVPR Findings 2026

详情
英文摘要

Model Inversion attacks aim to reconstruct information from private training data by exploiting access to a target model. Nearly all recent MI studies evaluate attack success using a standard framework that computes attack accuracy through a secondary evaluation model trained on the same private data and task design as the target model. In this paper, we present the first in-depth analysis of this dominant evaluation framework and reveal a fundamental issue: many reconstructions deemed successful under the existing framework are in fact false positives that do not capture the visual identity of the target individual. We first show that these MI false positives satisfy the same formal conditions as Type I adversarial examples. Our controlled experiments, we demonstrate extremely high false-positive transferability, an empirical signature characteristic of adversarial behavior, indicating that many MI false positives likely contain Type I adversarial features. This adversarial transferability significantly inflates reported attack accuracy and leads to an overstatement of privacy leakage in existing MI work. To address this issue, as our second contribution, we introduce a new evaluation framework based on MLLMs, whose general-purpose visual reasoning avoids the shared-task vulnerability and reduces Type-I adversarial transferability of current evaluation framework. We propose systematic design principles for MLLM-based evaluation. Using this framework, we reassess 27 MI attack setups across diverse datasets, target models, and priors, and find consistently high false-positive rates under the conventional approach. Our results call for a reevaluation of progress in MI research and establish MLLM-based evaluation as a more reliable standard for assessing privacy risks in machine learning systems. Code/data/prompt are available at https://hosytuyen.github.io/projects/FMLLM

2505.01584 2026-05-15 cs.LG cs.AI

Silent Neuron Theory and Plasticity Preservation for Deep Reinforcement Learning in Adaptive Video Streaming

Zhiqiang He, Zhi Liu

AI总结 本文研究了深度强化学习在自适应视频流中的应用,针对实际网络带宽异质性导致的模型泛化能力不足问题,提出了“静默神经元理论”以更准确地刻画神经网络的可塑性退化现象。基于该理论,作者设计了Reset Silent Neuron(ReSiN)方法,通过结合前向和后向传播状态的策略性神经元重置,有效保持网络可塑性,从而提升模型在非稳态网络环境下的适应能力。实验表明,ReSiN在比特率和QoE指标上显著优于现有方法,且在不同网络条件下均表现出良好的鲁棒性。

详情
英文摘要

Adaptive video streaming optimizes Quality of Experience (QoE) metrics by selecting appropriate bitrates according to varying network bandwidth and user demands. In practice, however, real-world network bandwidth often exhibits heterogeneity relative to training environments. Current methods predominantly tackle this problem through learning-based approaches designed to improve generalization performance. While our systematic investigation reveals a critical limitation: neural networks suffer from plasticity loss, significantly impeding their ability to adapt to heterogeneous network conditions. Through theoretical analysis of neural propagation mechanisms, we demonstrate that existing dormant neuron metrics inadequately characterize neural plasticity loss. To address this limitation, we have developed the Silent Neuron theory, which provides a more comprehensive framework for understanding plasticity degradation. Based on these theoretical insights, we propose the Reset Silent Neuron (ReSiN), which preserves neural plasticity through strategic neuron resets guided by both forward and backward propagation states. Moreover, we establish a tighter performance bound for ReSiN under non-stationary network conditions. In our implementation of an adaptive video streaming system, ReSiN has shown significant improvements over existing solutions, achieving up to 168% higher bitrate and 108% better quality of experience (QoE) while maintaining comparable smoothness. Furthermore, ReSiN consistently outperforms in stationary environments, demonstrating its robust adaptability across different network conditions.

2504.18544 2026-05-15 cs.LG cs.AI cs.CY

Critical Challenges and Guidelines in Evaluating Synthetic Tabular Data: A Systematic Review

Nazia Nafis, Inaki Esnaola, Alvaro Martinez-Perez, Maria-Cruz Villa-Uriol, Venet Osmani

AI总结 该论文系统回顾了近年来合成表格健康数据生成与评估领域的研究,指出了当前在评估方法上缺乏共识、指标应用不一致、领域专家参与不足等关键挑战。为应对这些问题,研究提出了结构化的分类框架和实用评估指南,旨在推动更严谨、标准化的评估实践,促进合成健康数据的负责任开发与应用。

Comments 32 pages

详情
英文摘要

Generating synthetic tabular health data is challenging, and evaluating their quality is equally, if not more, complex. This systematic review highlights the critical importance of rigorous evaluation of synthetic health data to ensure reliability, clinical relevance, and appropriate use. From an initial identification of 2067 relevant papers published in the last ten years, 134 studies were selected for detailed analysis. Our review identifies key challenges, including lack of consensus on evaluation methods, inconsistent application of evaluation metrics, limited involvement of domain experts, inadequate reporting of dataset characteristics, and limited reproducibility of results. In response, we provide a structured consolidation of synthetic data generation and evaluation methods into taxonomies, alongside practical guidelines to support more robust and standardised evaluation practices. These findings aim to support the responsible development and use of synthetic health data, aligned with emerging expectations around transparency, reproducibility, and governance, ultimately enabling the community to fully harness its transformative potential and accelerate innovation.

2504.09549 2026-05-15 cs.CV

SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification

Yuhao Wang, Xiang Hu, Lixin Wang, Pingping Zhang, Huchuan Lu

AI总结 本文提出了一种名为SD-ReID的生成框架,用于解决航拍与地面视角下的人再识别(AG-ReID)问题。该方法通过结合生成模型与可控条件,学习不同视角下的特征分布,从而提取更具鲁棒性的身份表示,并引入视图细化解码器以增强特征对齐能力。实验表明,该方法在多个AG-ReID数据集上均取得了优异的性能。

Comments This work is accepted by IEEE TIP 2026. More modifications may performed

详情
英文摘要

Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative models to maintain the identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust model is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model's ability to represent persons. To address these issues, we propose a novel generative framework named SD-ReID for AG-ReID, which leverages generative models to mimic the feature distribution of different views while extracting robust identity representations. More specifically, we first train a ViT-based model to extract person representations along with controllable conditions, including identity and view conditions. We then fine-tune the Stable Diffusion (SD) model to enhance person representations guided by these controllable conditions. Furthermore, we introduce the View-Refined Decoder (VRD) to bridge the gap between instance-level and global-level features. Finally, both person representations and all-view features are employed to retrieve target persons. Extensive experiments on five AG-ReID benchmarks (i.e., CARGO, AG-ReIDv1, AG-ReIDv2, LAGPeR and G2APS-ReID) demonstrate the effectiveness of our proposed method. The source code and pre-trained models are available at https://github.com/924973292/SD-ReID.

2504.07738 2026-05-15 cs.CL

Automated Construction of a Knowledge Graph of Nuclear Fusion Energy for Effective Elicitation and Retrieval of Information

Andrea Loreti, Kesi Chen, Ruby George, Robert Firth, Adriano Agnello, Shinnosuke Tanaka

AI总结 本文提出了一种多步骤方法,用于自动构建核聚变能源领域的知识图谱,以有效组织和表示大规模文档中的专业知识。研究重点在于利用预训练的大语言模型实现自动命名实体识别与实体解析,并通过Zipf定律评估其性能。此外,作者开发了一种基于知识图谱的检索增强生成系统,能够通过多轮提示机制,为自然语言查询提供上下文相关的答案,尤其适用于需要跨实体推理的复杂问题。

详情
英文摘要

In this document, we discuss a multi-step approach to automated construction of a knowledge graph, for structuring and representing domain-specific knowledge from large document corpora. We apply our method to build the first knowledge graph of nuclear fusion energy, a highly specialized field characterized by vast scope and heterogeneity. This is an ideal benchmark to test the key features of our pipeline, including automatic named entity recognition and entity resolution. We show how pre-trained large language models can be used to address these challenges and we evaluate their performance against Zipf's law, which characterizes human natural language. Additionally, we develop a knowledge-graph retrieval-augmented generation system that uses multiple prompts with large language models to provide contextually relevant answers to natural-language queries, including complex multi-hop questions requiring reasoning across interconnected entities.

2504.01527 2026-05-15 cs.CV eess.IV

Beyond Nearest Neighbor Interpolation in Data Augmentation

Olivier Rukundo

AI总结 该论文探讨了在数据增强过程中使用最近邻插值可能带来的问题,包括加剧像素级标注错误和降低感兴趣区域的高频结构细节。为此,作者提出了一种改进的卷积神经网络数据变换方法,引入了修改后的几何变换函数,并结合基于均值的类别过滤机制,以替代最近邻插值并处理未定义类别标签。实验结果表明,该方法在多个医学图像分割数据集上提升了模型性能。

Comments 10 pages, 11 figures, 14 tables

详情
英文摘要

Avoiding the risk of undefined categorical labels using nearest neighbor interpolation overlooks the risk of exacerbating pixel level annotation errors in augmented training data. Additionally, the inherent low pass filtering effects of interpolation algorithms exacerbate the risk of degrading high frequency structural details within annotated regions of interest. To avoid these risks, the author modified convolutional neural networks data transformation functions by incorporating a modified geometric transformation function, removing reliance on nearest neighbor interpolation, and integrating a mean-based class filtering mechanism to handle undefined categorical labels with alternative interpolation algorithms. The author also implemented an offline data augmentation pipeline to generate interpolation specific augmented training data, enabling quantitative assessment of interpolation specific low pass filtering effects on augmented training data. Experimental evaluation on three medical image segmentation datasets and the XBAT+ datasets demonstrated performance gains across multiple quantitative metrics.

2502.17347 2026-05-15 cs.RO

SoFFT: Spatial Fourier Transform for Modeling Continuum Soft Robots

Daniele Caradonna, Diego Bianchi, Franco Angelini, Egidio Falotico

AI总结 本文提出了一种基于空间傅里叶变换(SoFFT)的建模方法,用于描述连续体软机器人的变形。该方法将机器人的主干结构视为时空信号,利用傅里叶变换对其进行紧凑表示,从而在保持变形精度的同时减少自由度。该方法不仅统一了现有的Cosserat杆理论建模策略,还提供了一种数据驱动的实验方法,通过数值仿真和实物实验验证了其有效性。

详情
英文摘要

Continuum soft robots, composed of flexible materials, exhibit theoretically infinite degrees of freedom, enabling notable adaptability in unstructured environments. Cosserat Rod Theory has emerged as a prominent framework for modeling these robots efficiently, representing continuum soft robots as time-varying curves, known as backbones. In this work, we propose viewing the robot's backbone as a signal in space and time, applying the Fourier transform to describe its deformation compactly. This approach unifies existing modeling strategies within the Cosserat Rod Theory framework, offering insights into commonly used heuristic methods. Moreover, the Fourier transform enables the development of a data-driven methodology to experimentally capture the robot's deformation. The proposed approach is validated through numerical simulations and experiments on a real-world prototype, demonstrating a reduction in the degrees of freedom while preserving the accuracy of the deformation representation.

2502.09198 2026-05-15 cs.LG

Understanding High-Dimensional Bayesian Optimization

Leonard Papenmeier, Matthias Poloczek, Luigi Nardi

AI总结 本文探讨了为什么简单的贝叶斯优化方法在高维现实任务中表现良好,这与以往的研究结论似乎相矛盾。研究发现,高维贝叶斯优化面临一些关键挑战,其中高斯过程初始化导致的梯度消失是影响性能的主要因素。作者提出通过最大似然估计确定高斯过程的长度尺度,并基于此设计了一种简单有效的方法MSR,在多个实际应用中达到了领先水平。

Comments 22 pages, 21 figures. Accepted to ICML 2025

详情
Journal ref
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:47902-47923, 2025
英文摘要

Recent work reported that simple Bayesian optimization (BO) methods perform well for high-dimensional real-world tasks, seemingly contradicting prior work and tribal knowledge. This paper investigates why. We identify underlying challenges that arise in high-dimensional BO and explain why recent methods succeed. Our empirical analysis shows that vanishing gradients caused by Gaussian process (GP) initialization schemes play a major role in the failures of high-dimensional Bayesian optimization (HDBO) and that methods that promote local search behaviors are better suited for the task. We find that maximum likelihood estimation (MLE) of GP length scales suffices for state-of-the-art performance. Based on this, we propose a simple variant of MLE called MSR that leverages these findings to achieve state-of-the-art performance on a comprehensive set of real-world applications. We present targeted experiments to illustrate and confirm our findings.

2502.08208 2026-05-15 cs.LG

Exploring Exploration in Bayesian Optimization

Leonard Papenmeier, Nuojin Cheng, Stephen Becker, Luigi Nardi

AI总结 在贝叶斯优化中,探索与利用的平衡对获取函数的性能至关重要。本文提出了两种新的度量方法——观测旅行商距离和观测熵,用于量化获取函数的探索特性。通过这些度量,研究分析了多种经典获取函数在不同黑箱问题中的探索行为,揭示了探索与实际性能之间的联系,并发现了现有获取函数之间的新关系,为获取函数的设计提供了更系统和原理化的指导。

Comments 28 pages, 34 figures

详情
Journal ref
Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, PMLR 286:3388-3415, 2025
英文摘要

A well-balanced exploration-exploitation trade-off is crucial for successful acquisition functions in Bayesian optimization. However, there is a lack of quantitative measures for exploration, making it difficult to analyze and compare different acquisition functions. This work introduces two novel approaches - observation traveling salesman distance and observation entropy - to quantify the exploration characteristics of acquisition functions based on their selected observations. Using these measures, we examine the explorative nature of several well-known acquisition functions across a diverse set of black-box problems, uncover links between exploration and empirical performance, and reveal new relationships among existing acquisition functions. Beyond enabling a deeper understanding of acquisition functions, these measures also provide a foundation for guiding their design in a more principled and systematic manner.

2409.10038 2026-05-15 cs.CL cs.AI cs.LG

On the Diagram of Thought

Yifan Zhang, Yang Yuan, Andrew Chi-Chih Yao

AI总结 大型语言模型(LLMs)在许多任务中表现出色,但在需要结构化、多步骤推理的复杂问题上表现不佳。本文提出了一种名为“思维图谱”(Diagram of Thought, DoT)的框架,使单个LLM能够构建和导航其推理过程的思维地图,通过动态构建思想图谱,模型可以提出不同的推理路径、自我批评并整合验证后的见解形成最终结论。该方法无需外部搜索算法或规划器,仅依赖于确定性的在线验证器,并基于范畴论的数学框架,为LLM的结构化推理过程提供了可审计的步骤追踪和语义保证。

Comments 30 pages

详情
英文摘要

Large Language Models (LLMs) excel at many tasks but often falter on complex problems that require structured, multi-step reasoning. We introduce the Diagram of Thought (DoT), a framework that enables a single LLM to build and navigate a mental map of its reasoning. Instead of thinking in a straight line, the model constructs a dynamic diagram of ideas, where it can propose different lines of thought, critique its own steps, and synthesize validated insights into a final conclusion. This process is controller-light: it does not require an external search algorithm or planner, but it does use a deterministic online validator for grammar-constrained typed traces, register constraints, and optional solver checks. To clarify the reliability target of this process, we ground DoT in a mathematical framework from category theory. We interpret accepted typed reasoning records as diagrams in a slice topos and model synthesis of the selected proposer subdiagram as a finite limit. In the predicate fragment, this same object is equivalently a variance-reversed colimit in the opposite information order. The resulting formalism gives an auditable, step-by-step trace of the LLM's typed reasoning and separates semantic guarantees for the typed subtrace from unconstrained natural-language text and uncertified operational edges.

2408.16307 2026-05-15 cs.RO cs.AI

Safe Bayesian Optimization for Complex Control Systems via Additive Gaussian Processes

Hongxuan Wang, Xiaocong Li, Lihao Zheng, Adrish Bhaumik, Prahlad Vadakkepat

AI总结 本文提出了一种名为 SafeCtrlBO 的安全贝叶斯优化方法,用于同时调整多级耦合控制器的参数,以解决复杂控制系统的安全优化问题。该方法通过使用加法高斯过程核来捕捉控制器增益之间的低阶结构,从而降低样本复杂度,并采用基于边界的扩展规则替代传统方法中的高计算成本步骤,以保证在硬件实验中的安全约束。实验表明,SafeCtrlBO 在减少硬件评估次数的同时,能够有效达到高性能控制器参数,并保持高概率安全性和硬信号安全约束的满足。

Comments The shorter version has been accepted by IEEE Robotics and Automation Letters. This is the full version

详情
英文摘要

Automatic controller tuning is attractive for robotics and mechatronic systems whose dynamics are difficult to model accurately, but direct black-box optimization can be unsafe because each query is executed on the physical plant. Existing safe Bayesian optimization (BO) methods provide high-probability safety guarantees, yet their practical use in multi-loop control is limited by two coupled difficulties: the controller parameter space is often moderately high-dimensional, and hardware evaluations are too expensive to allow hundreds or thousands of exploratory trials. This paper proposes \textsc{SafeCtrlBO}, a safe BO method for simultaneously tuning multiple coupled controllers. The method uses additive Gaussian-process kernels to encode low-order structure across controller gains and reduce the sample complexity associated with dense full-dimensional kernels. It also replaces the expensive potential-expander computation used in \textsc{SafeOpt}-style exploration with a boundary-based expansion rule that preserves the intended safe-set expansion behavior under explicit geometric conditions and is validated empirically. Experiments on synthetic benchmarks and on a permanent magnet synchronous motor (PMSM) speed-control platform show that \textsc{SafeCtrlBO} reaches high-performing controller parameters with fewer hardware evaluations than representative safe BO baselines, while maintaining the prescribed high-probability safety criterion and avoiding violations of the hard signal-safety constraint in the hardware study. The code implementation is publicly available at https://github.com/hxwangnus/SafeCtrlBO.

2405.07459 2026-05-15 cs.CV

DAPL: Integration of Positive and Negative Descriptions in Text-Based Person Search

Yuchuan Deng, Zhanpeng Hu, Zijie Xin, Chuang Deng, Qijun Zhao

AI总结 本文研究了基于文本的行人检索(TBPS)任务中如何有效整合正负描述信息的问题。现有方法主要关注正向属性,忽视了负向描述的重要性,可能导致误检。为此,作者提出了DAPL框架,通过结合正负描述,引入双属性对比学习和敏感属性匹配学习,提升模型对未见属性的识别能力,并设计动态词元相似度损失函数,优化视觉与文本嵌入的对齐精度,显著提升了TBPS任务的准确性和鲁棒性。

详情
Journal ref
2025 IEEE International Conference on Multimedia and Expo (ICME)
英文摘要

Text-based person search (TBPS) aims to retrieve specific images of individuals from large datasets using textual descriptions. Existing TBPS methods focus primarily on identifying explicit positive attributes, often neglecting the critical role of negative descriptions. This oversight can lead to false positives, where images that should be excluded based on negative descriptions are incorrectly included, due to partial alignment with the positive criteria. To address this limitation, we propose the Dual Attribute Prompt Learning (DAPL) framework, which incorporates both positive and negative descriptions to improve the interpretative accuracy of vision-language models in TBPS tasks. DAPL combines Dual Image-Attribute Contrastive (DIAC) learning with Sensitive Image-Attribute Matching (SIAM) learning to enhance the detection of previously unseen attributes. Furthermore, to achieve a balance between coarse and fine-grained alignment of visual and textual embeddings, we introduce the Dynamic Token-wise Similarity (DTS) loss. This loss function refines the representation of both matching and non-matching descriptions at the token level, providing more precise and adaptable similarity assessments, and ultimately improving the accuracy of the matching process. Empirical results demonstrate that DAPL outperforms state-of-the-art methods, enhancing both precision and robustness in TBPS tasks.

2304.11468 2026-05-15 cs.LG stat.ML

Increasing the Scope as You Learn: Adaptive Bayesian Optimization in Nested Subspaces

Leonard Papenmeier, Luigi Nardi, Matthias Poloczek

AI总结 本文提出了一种名为BAxUS的自适应贝叶斯优化方法,通过引入嵌套随机子空间,在优化过程中动态调整搜索空间,以应对高维黑箱函数优化中的性能下降问题。该方法在理论上保证了稳定性,并在多个应用任务中表现出优于现有先进方法的优化效果。

Comments 28 pages, 8 figures. Accepted to NeurIPS 2022. This is the revised version and includes the appendix

详情
Journal ref
Advances in Neural Information Processing Systems 35 (NeurIPS 2022), pp. 11586-11601
英文摘要

Recent advances have extended the scope of Bayesian optimization (BO) to expensive-to-evaluate black-box functions with dozens of dimensions, aspiring to unlock impactful applications, for example, in the life sciences, neural architecture search, and robotics. However, a closer examination reveals that the state-of-the-art methods for high-dimensional Bayesian optimization (HDBO) suffer from degrading performance as the number of dimensions increases or even risk failure if certain unverifiable assumptions are not met. This paper proposes BAxUS that leverages a novel family of nested random subspaces to adapt the space it optimizes over to the problem. This ensures high performance while removing the risk of failure, which we assert via theoretical guarantees. A comprehensive evaluation demonstrates that BAxUS achieves better results than the state-of-the-art methods for a broad set of applications.